Chinese character information technology, shortly Chinese character IT, is the information technology for computer processing of

Chinese characters Chinese characters are logographs used Written Chinese, to write the Chinese languages and others from regions historically influenced by Chinese culture. Of the four independently invented writing systems accepted by scholars, they represe ...

. While the English writing system uses a few dozen different characters, Chinese language needs a much larger character set. There are over ten thousand characters in the ''

Xinhua Dictionary The ''Xinhua Zidian'' (), also as ''Xinhua Dictionary'', is a Chinese-language dictionary published by the Commercial Press. The first edition of ''Xinhua Zidian'' was published in 1957. The latest version is the 12th edition, which was publis ...

''. In the Unicode multilingual character set of 149,813 characters, 98,682 (about two-thirds) are Chinese. That means computer processing of Chinese characters is the toughest among other languages. Chinese faces special issues compared to other languages, including the technology of computer input, internal encoding and output of Chinese characters.

Character input

Computer input of Chinese characters is by no means as easy as English. English is written with 26 letters and a handful of other characters, and each character is assigned to a key on the keyboard. Chinese can be input in a similar way. However that would involve a huge keyboard with at least thousands of keys. Searching for a character on the keyboard would be a daunting job. People did try to 'shrink' the Chinese keyboard by putting multiple characters on one key. That turned the original one-step input procedure into two steps for the writer: # pressing the key for the character group of the target character, # selecting the target character in the group. The resulting keyboard still remained clumsy, because if you put more characters on one key, the key becomes bigger to make the characters recognizable, and selecting a character from a large group is difficult. Additionally, it is not easy to group the characters evenly in a reasonable and easy-to-learn way. Another drawback of a Chinese keyboard for direct whole character input is its inconsistency with English input. An alternative way is to encode each Chinese character in English characters, enabling Chinese input on an English keyboard. As a matter of fact, this method has become predominant for Chinese computer input. The software of an encoding input method includes a character-code table (). When an

ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...

input code is typed on the English keyboard, the software will search for matching Chinese characters in the table. If there are multiple characters sharing the same code, they will be presented to the user for selection. To make the input method easy to learn, encoding must be based on distinctive features in forms, sounds or meanings of Chinese characters. Because the meanings of characters tend to be more abstract and complicated, input encoding is normally based on the sound or form.

Sound-based encodings

Sound-based encoding is normally based on an existing Latin character scheme for Chinese phonetics, such as

pinyin Hanyu Pinyin, or simply pinyin, officially the Chinese Phonetic Alphabet, is the most common romanization system for Standard Chinese. ''Hanyu'' () literally means 'Han Chinese, Han language'—that is, the Chinese language—while ''pinyin' ...

for Putonghua, and

Jyutping The Linguistic Society of Hong Kong Cantonese Romanization Scheme, also known as Jyutping, is a romanisation system for Cantonese developed in 1993 by the Linguistic Society of Hong Kong (LSHK). The name ''Jyutping'' (itself the Jyutping ro ...

for Cantonese. The input code of a Chinese character is its pinyin letter string followed by an optional number representing the tone. For example, the Putonghua pinyin input code of (Hong Kong) is ''xianggang'' or ''xiang1gang3'', and the Cantonese Jyutping code is ''hoenggong'' or ''hoeng1gong2'', all of which can be easily input via an English keyboard. In Putonghua pinyin, there are two letters not appearing on the English keyboard: ê and ü. According to the national standard, ê should be represented by 'ea', and ü by 'v' in the pinyin input code. In some Chinese input software ê is also represented as 'e^', and ü as 'u:' or 'uu'. Popular sound-based input methods in China include Microsoft Pinyin, Sogou Pinyin, Google Pinyin and Jyutping on the mainland and Hong Kong, and

bopomofo Bopomofo, also called Zhuyin Fuhao ( ; ), or simply Zhuyin, is a Chinese transliteration, transliteration system for Standard Chinese and other Sinitic languages. It is the principal method of teaching Chinese Mandarin pronunciation in Taiwa ...

in Taiwan. There are a number of advantages for sound-based encoding: # Easy to learn because most Chinese writers have already got a good command of Putonghua and pinyin. # Consistent with Chinese language learning. # Allows simplified and traditional Chinese characters to be input in a similar way. # Allows writing Chinese and English on the same keyboard. The shortcomings of sound-based encoding lie in its high degree of duplicate encoding, with homophone Chinese characters sharing the same code. A Chinese character is normally pronounced with one syllable. Chinese Putonghua only has about 400 different syllables without considering tones, or approximately 1,200 syllables when tones are considered. On the other hand, there are tens of thousands of Chinese characters. That means on the average, each syllable has to cover over 10 characters. This problem can be largely solved by inputting Chinese word by word instead of character by character, because most words in modern Chinese consist of more than one character and duplicate encoding is much less frequent at words level. For example, the pinyin of 香港 (Hong Kong) is unique to the word, while either character 香 or 港 shares its pronunciation with many other characters. Another limitation of sound-based Chinese input is that you must know the pronunciation of a Chinese character before you can input it into the computer. This issue can be solved by form-based encoding.

Form-based encodings

A Chinese character can alternatively be input according to its form (or shape) and structure. Most Chinese characters can be divided into a sequence of components each of which is in turn composed of a sequence of strokes in writing order. For example, the character 福 ('good fortune', 'happiness') can be decomposed as There are a few hundred basic components, much less than the number of characters. By representing each component with an English letter and putting them in writing order of the character, the input method creator can get a letter string ready to be used as an input code on the English keyboard. Of course the creator can also design a rule to select representative letters from the string if it is too long. For example, in the

Cangjie input method The Cangjie input method (Tsang-chieh input method, sometimes called Changjie, Cang Jie, Changjei or Chongkit) is a system for entering Chinese characters into a computer using a standard computer keyboard. In filenames and elsewhere, the name C ...

, character 疆 ('border') is encoded as "NGMWM" corresponding to components "弓土一田一", with some components omitted. Stroke-based coding is simpler than component-based coding. But the codes tend to be longer. There are approximately 30~40 distinctive strokes of Chinese characters. They are usually classified into five categories of heng (一), shu (丨), pie (丿), dian (丶) and zhe (𠃍) for dictionary consultancy and Chinese input on a mobile phone. For Chinese input with an ASCII keyboard, 2 strokes can be combined to form 5*5=25 different pairs for mapping to the English letters. For example, in input method ZYQ, the sequence of stroke pairs '一一, 一丨, 一丿, ..., 𠃍丿, 𠃍丶, 𠃍𠃍' are represented by 'a, b, c, ..., w, x, y' respectively. Popular form-based encoding methods include Wubi on the mainland and

Cangjie Cangjie is a legendary figure in Chinese mythology, said to have been an official historian of the Yellow Emperor and the inventor of Chinese characters. Legend has it that he had four eyes, and that when he invented the characters, the deities ...

in Taiwan and Hong Kong. The pros and cons of form-based input methods are complementary to sound-based methods. The major advantage of form-based methods lies in their low degree of duplicate encoding, enabling high speed input of Chinese characters. And the major shortcoming is difficulty of learning. Normally students have to remember over one hundred components and their corresponding English letters. In addition, they have to learn the complicated rules for breaking a character into a sequence of components and making a selection among them.

Optical character recognition

Chinese characters can also be input into the computer by

optical character recognition Optical character recognition or optical character reader (OCR) is the electronics, electronic or machine, mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo ...

(OCR), handwriting recognition and

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

based on technology similar to that of English. Compared with English, Chinese OCR and handwriting recognition is more difficult, because there are thousands of different commonly-used characters instead of 26 letters. Generally speaking, print character recognition is more accurate than handwriting characters because their forms are more standardized. There are OCR tools for different fonts, including the popular Song, Kai and Hei. In comparison with offline handwriting, online handwriting recognition is more efficient, because the computer not only 'sees' the written character but also the procedure of writing it.

Speech recognition

Speech recognition converts a continuous speech signal into a sequence of words. There are two problems: the variation in pronunciation of words by different speakers and the existence of homophones such as 'pair', 'pear' and 'pare' in English, and 攻势, 公式, 公示 (gong1shi4) in Chinese. Speech recognition relies on corpus statistical methods and linguistic rules. A helpful feature of Chinese is that each character is pronounced with one syllable. Both Chinese character recognition and speech recognition has reached application level. However neither can guarantee 100% correctness without human proofreading or online character selection.

Intelligent input engines

The most important feature of intelligent input is application of contextual constraints for candidate characters selection. For example, on Microsoft Pinyin, when the user types input code "daxuejiaoshou", they will get 大学教授 (University Professor), when types "daxuepiaopiao" the computer suggested 大雪飘飘 (heavy snow flying). Though the non-diacritical pinyin letters of 大学 and 大雪 are both "daxue", the computer can make a reasonable selection based on the subsequent words. Intelligent Chinese input also makes use of corpus information and linguistic rules. The computer's selection among ambiguous Chinese characters is not always correct, and further improvement is required.

Other input

In the

Chinese writing system Written Chinese is a writing system that uses Chinese characters and other symbols to represent the Chinese languages. Chinese characters do not directly represent pronunciation, unlike letters in an alphabet or syllabograms in a syllabary. Rathe ...

, there are

graphemes In linguistics, a grapheme is the smallest functional unit of a writing system. The word ''grapheme'' is derived from Ancient Greek ('write'), and the suffix ''-eme'' by analogy with ''phoneme'' and other emic units. The study of graphemes ...

other than complete Chinese characters, such as

punctuation Punctuation marks are marks indicating how a piece of writing, written text should be read (silently or aloud) and, consequently, understood. The oldest known examples of punctuation marks were found in the Mesha Stele from the 9th century BC, c ...

marks (e.g. '。', '、' and '《》'),

strokes Stroke is a medical condition in which poor blood flow to a part of the brain causes cell death. There are two main types of stroke: ischemic, due to lack of blood flow, and hemorrhagic, due to bleeding. Both cause parts of the brain to stop ...

(e.g. '丿', '𠃍' and '乚'),

radicals Radical (from Latin: ', root) may refer to: Politics and ideology Politics *Classical radicalism, the Radical Movement that began in late 18th century Britain and spread to continental Europe and Latin America in the 19th century *Radical politics ...

(e.g. '氵', '宀' and '刂'), and letters used for romanization, like the vowel letters with diacritics used in pinyin and the

Yale romanization of Cantonese The Yale romanization of Cantonese was developed by Yale scholar Gerard P. Kok for his and Parker Po-fei Huang's textbook ''Speak Cantonese'' initially circulated in looseleaf form in 1952 but later published in 1958. Unlike the Yale romaniz ...

. (e.g. 'ā', 'á', 'ǎ', 'à'). There are facilities available on Microsoft Windows, Office and the web, which will enable us to input almost all of these Chinese auxiliary characters, ranging from the input of punctuation marks in general Chinese input methods, to inputting diacritical pinyin with soft keyboards, to inputting strokes and radicals from the Unicode website and by Unicode-character conversion, as well as the application of special tools on the Web to input pinyin and other characters. More information on non-logogram input can be found in paper, which includes a list of 280 non-ASCII non-logograms, with each annotated with its Unicode code point and the input code of the author's design. It is also possible to input a character on Microsoft Word by typing its Unicode code point and pressing keys Alt+X.

Chinese character encoding for information interchange

Inside the computer each character is represented by an internal code. When a character is sent between two machines, it is in information interchange code. Nowadays, information interchange codes, such as ASCII and Unicode, are often directly employed as internal codes. The following sections will introduce the most important encoding standards used in Chinese information technology, including GB,

Big5 Big-5 or Big5 ( zh, t=大五碼) is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters. The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 ...

and

Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...

GB

GB stands for

Guobiao The National Standards of the People's Republic of China (), coded as , are the standards issued by the Standardization Administration of China under the authorization of Article 10 of the Standardization Law of the People's Republic of China. ...

, "Guojia Biaozhun" (国家标准, or ‘national standard’) in

Putonghua Standard Chinese ( zh, s=现代标准汉语, t=現代標準漢語, p=Xiàndài biāozhǔn hànyǔ, l=modern standard Han speech) is a modern standard form of Mandarin Chinese that was first codified during the republican era (1912–1949). ...

, and is the prefix for reference numbers of official standards issued by the

People's Republic of China China, officially the People's Republic of China (PRC), is a country in East Asia. With population of China, a population exceeding 1.4 billion, it is the list of countries by population (United Nations), second-most populous country after ...

. The first GB Chinese character encoding standard is

GB 2312 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. ''GB'' refers to the Guobiao standards (国家标准), ...

, which was released in 1980. It includes 6,763 Chinese characters, with 3,755 frequently-used ones sorted by

Pinyin Hanyu Pinyin, or simply pinyin, officially the Chinese Phonetic Alphabet, is the most common romanization system for Standard Chinese. ''Hanyu'' () literally means 'Han Chinese, Han language'—that is, the Chinese language—while ''pinyin' ...

, and the rest by

(indexing components). GB2312 was designed for

simplified Chinese characters Simplified Chinese characters are one of two standardized Chinese characters, character sets widely used to write the Chinese language, with the other being traditional characters. Their mass standardization during the 20th century was part of ...

Traditional characters Traditional Chinese characters are a standard set of Chinese character forms used to write Chinese languages. In Taiwan, the set of traditional characters is regulated by the Ministry of Education and standardized in the ''Standard Form of ...

which have been simplified are not covered. The code of a character is represented by a two-byte hexadecimal number, for instance, the GB codes of (Hong Kong) are CFE3 and B8DB respectively. GB2312 is still in use on some computers and the WWW, though newer versions with extended character sets, such as GB13000.1 and GB18030, have been released. The latest version of GB encoding is

GB18030 GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet n ...

. GB18030 supports both simplified and traditional Chinese characters, and is consistent with Unicode's character set.

Big5

encoding was designed by five big IT companies in Taiwan in the early 1980s, and has been the de facto standard for representing traditional Chinese in computers ever since. Big5 is popularly used in Taiwan, Hong Kong and Macau. The original Big5 standard included 13,053 Chinese characters, with no simplified characters of the Mainland. Each character is encoded with a two byte hexadecimal code, for example, 香 (ADBB) 港 (B4E4) 龍 (C073). Chinese characters in the Big5 character set are arranged in radical order. Extended versions of Big5 include Big-5E and Big5-2003, which include some simplified characters and Hong Kong Cantonese characters.

Unicode

Unicode is the most influential international standard for multilingual character encoding. It is consistent with (or virtually equivalent to) standard ISO/IEC10646. The full version of Unicode represents a character with a 4-byte digital code, providing a huge encoding space to cover all characters of all languages in the world. The Basic Multilingual Plane (BMP) is a 2-byte kernel version of Unicode with 2^16=65,536 code points for important characters of many languages. There are 27,522 characters in the CJKV (China, Japan, Korea and Vietnam) Ideographs Area, including all the simplified and traditional Chinese characters in GB2312 and Big5 traditional. In Unicode 15.0, there is a multilingual character set of 149,813 characters, among which 98,682, about two-thirds, are Chinese sorted by

Kangxi radicals The ''Kangxi'' radicals (), also known as ''Zihui'' radicals, are a set of 214 radicals that were collated in the 18th-century '' Kangxi Dictionary'' to aid categorization of Chinese characters. They are primarily sorted by stroke count. They ...

. Even very rarely-used characters are available. The following are some example characters with their Unicode put in brackets: H (0048) K (004B), 香 (9999), 港 (6E2F), 龍(9F8D), 龙 (9F99), 龖 (9F96), 龘 (9F98), 𪚥 (2A6A5). All the 5,009 characters of the Hong Kong Supplementary Character Set (

HKSCS The Hong Kong Supplementary Character Set (; commonly abbreviated to HKSCS) is a set of Chinese characters – 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong (whether in wr ...

) are included in Unicode. HKSCS was developed by the Hong Kong government as a collection of locally specific Chinese characters not available on the computer in the early days, for instance 咗 (already), 嘢 (thing), 脷 (tongue), and 曱甴 (cockroach). As GB, Big5 and Unicode are concurrently used in Chinese encoding, when the computer mistakenly interprets a text with an encoding standard different from its original code, it will be presented with wrong characters, a phenomenon called "luànmǎ" (code confusing), which occasionally happens on the Web or in emails. This problem is often solved by manual selection of encoding or character set (such as the case on Web browsers) or by code conversion beforehand. Unicode is becoming more and more popular. It is reported that UTF-8 (Unicode) is used by 98.1% of all the websites. It is widely believed that Unicode will ultimately replace all other information interchange codes and internal codes, and there will be no more code confusing.

Output

Typefaces

Like English and other languages, Chinese characters are output on printers and screens in different

fonts In movable type, metal typesetting, a font is a particular #Characteristics, size, weight and style of a ''typeface'', defined as the set of fonts that share an overall design. For instance, the typeface Bauer Bodoni (shown in the figure) inclu ...

and styles. The most popular Chinese fonts are the Song (宋体), Kai (楷体), Hei (黑体) and Fangsong (仿宋体) families. Chenzihmyon typefaces

References

Citations

Works cited

* * * * * * * * * * * * * {{Refend Computational linguistics IT