HOME

TheInfoList



OR:

GBK is an extension of the
GB 2312 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. ''GB'' refers to the Guobiao standards (国家标准), ...
character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up a c ...
for
Simplified Chinese characters Simplified Chinese characters are one of two standardized Chinese characters, character sets widely used to write the Chinese language, with the other being traditional characters. Their mass standardization during the 20th century was part of ...
, used in the
People's Republic of China China, officially the People's Republic of China (PRC), is a country in East Asia. With population of China, a population exceeding 1.4 billion, it is the list of countries by population (United Nations), second-most populous country after ...
. It includes all unified
CJK characters In internationalization, CJK characters is a collective term for graphemes used in the Chinese, Japanese, and Korean writing systems, which each include Chinese characters. It can also go by CJKV to include Chữ Nôm, the Chinese-origin lo ...
found in , i.e. ISO/IEC 10646:1993, or Unicode 1.1. Since its initial release in 1993, GBK has been extended by Microsoft in Code page 936/1386, which was then extended into GBK 1.0. GBK is also the IANA-registered internet name for the Microsoft mapping, which differs from other implementations primarily by the single-byte
euro sign The euro sign () is the currency sign used for the euro, the official currency of the eurozone. The design was presented to the public by the European Commission on 12 December 1996. It consists of a stylized letter E (or epsilon), crossed by ...
at 0x80. ''GB'' abbreviates Guójiā Biāozhǔn, which means ''national standard'' in Chinese, while ''K'' stands for ''Extension'' (扩展 ''kuòzhǎn''). GBK not only extended the old standard with Traditional Chinese characters, but also with Chinese characters that were simplified after the establishment of in 1981. With the arrival of GBK, certain names with characters formerly unrepresentable, like the 镕 (''róng'') character in former Chinese Premier Zhu Rongji's name, are now representable. , GBK is the third-most popular encoding served from China and territories (after
UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8. UTF-8 supports all 1,112,0 ...
and the subset ), with 1.9% of web servers serving a page that declares GBK. However, all major web browsers decode GB2312-marked documents as if they were marked GBK, except for Safari and Edge on the label GB_2312. Together, GBK and encodings have a combined 5.5% presence in China and territories. Globally, GBK accounts for less than 0.07% of all web pages and GBK+GB2312 for 0.2%.


History

In 1993, the Unicode 1.1 standard was released, including 20,902 characters used in
mainland China "Mainland China", also referred to as "the Chinese mainland", is a Geopolitics, geopolitical term defined as the territory under direct administration of the People's Republic of China (PRC) in the aftermath of the Chinese Civil War. In addit ...
,
Taiwan Taiwan, officially the Republic of China (ROC), is a country in East Asia. The main geography of Taiwan, island of Taiwan, also known as ''Formosa'', lies between the East China Sea, East and South China Seas in the northwestern Pacific Ocea ...
,
Japan Japan is an island country in East Asia. Located in the Pacific Ocean off the northeast coast of the Asia, Asian mainland, it is bordered on the west by the Sea of Japan and extends from the Sea of Okhotsk in the north to the East China Sea ...
and
Korea Korea is a peninsular region in East Asia consisting of the Korean Peninsula, Jeju Island, and smaller islands. Since the end of World War II in 1945, it has been politically Division of Korea, divided at or near the 38th parallel north, 3 ...
. Following this, China released , the Guobiao standard equivalent of Unicode 1.1. The GBK character set was defined in 1993 as an extension of , while also including the characters of GB 13000.1-93 through the unused codepoints available in GB 2312. Hence GBK is backward compatible with GB 2312. GBK was defined in a normative annex to GB 13000.1-93. Microsoft implemented GBK in
Windows 95 Windows 95 is a consumer-oriented operating system developed by Microsoft and the first of its Windows 9x family of operating systems, released to manufacturing on July 14, 1995, and generally to retail on August 24, 1995. Windows 95 merged ...
and Windows NT 3.51 as Code Page 936. While GBK was never an official standard, widespread usage of Windows 95 led to GBK becoming the ''de facto'' standard. While GBK included all the Chinese characters defined in Unicode 1.1 and GB 13000.1-93, these standards used different code tables. The primary reason for its existence was simply to bridge the gap between GB 2312-80 and GB 13000.1-93. In 1995, China National Information Technology Standardization Technical Committee set down the Chinese Internal Code Extension Specification ( zh, s=汉字内码扩展规范 (GBK), p=Hànzì Nèimǎ Kuòzhǎn Guīfàn (GBK)), Version 1.0, known as GBK 1.0, which is a slight extension of Codepage 936. The newly added 95 characters were not found in GB 13000.1-1993, and were provisionally assigned Unicode PUA code points. Microsoft later added the
euro sign The euro sign () is the currency sign used for the euro, the official currency of the eurozone. The design was presented to the public by the European Commission on 12 December 1996. It consists of a stylized letter E (or epsilon), crossed by ...
to Code page 936 and assigned the code 0x80 to it. This is not a valid code point in GBK 1.0. In 2000, the standard was released, superseding yet maintaining compatibility with GBK 1.0. It increased the number of definitions of Chinese characters and extended the number of possible characters through the implementation of four-byte character spaces. The subset of GB 18030 consisting of one-byte and two-byte characters is sometimes also referred to as GBK. Mapping to Unicode has been slightly changed, though, as some characters are now defined in Unicode. In the most up-to-date form of the standard, GB 18030-2005, only 24 characters are still mapped to Unicode PUA (see GB 18030#PUA.) In 2002, GBK was registered as an IANA charset; the registration uses code page 936 mapping as well as CP936/MS936 aliases, but refers to GBK 1.0 specification.
W3C The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working together in ...
's technical recommendation published in 2015 defines a GBK ''encoder'' as a GB 18030 encoder with a single-byte euro sign and without four-byte sequences (while W3C's GBK ''decoder'' specification has no such limitation, decodes as , i.e. with same range of letters as all of
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
).


Encoding

A character is encoded as 1 or 2 bytes. A byte in the range 007F is a single byte that means the same thing as it does in
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
. Strictly speaking, there are 95 characters and 33 control codes in this range. A byte with the high bit set indicates that it is the first of 2 bytes. Loosely speaking, the first byte is in the range 81FE (that is, never 80 or FF), and the second byte is 40A0 except 7F for some areas and A1FE for others. More specifically, the following ranges of bytes are defined:


Layout diagram

In graphical form, the following figure shows the space of all 64K possible 2-byte codes. Green and yellow areas are assigned GBK codepoints, red are for user-defined characters. The uncolored areas are invalid byte combinations.


Relationship to other encodings

The areas indicated in the previous section as GBK/1 and GBK/2, taken by themselves, is simply in its usual encoding, GBK/1 being the non-hanzi region and GBK/2 the hanzi region. GB 2312, or more properly the EUC-CN encoding thereof, takes a pair of bytes from the range A1FE, like any 94² ISO-2022 character set loaded into GR. This corresponds to the lower-right quarter of the illustration above. However, GB 2312 does not assign any code points to the rows located at AAB0 and F8FE, even though it had staked out the territory. GBK added extensions to these rows. You can see that the two gaps were filled in with user-defined areas. More significantly, GBK extended the range of the bytes. Having two-byte characters in the ISO-2022 GR range gives a limit of 94²=8,836 possibilities. Abandoning the ISO-2022 model of strict regions for graphics and control characters, but retaining the feature of low bytes being 1-byte characters and pairs of high bytes denoting a character, you could potentially have 128²=16,384 positions. GBK takes part of that, extending the range from A1FE (94 choices for each byte) to 81FE (126 choices) for the first byte and 40FE (191 choices) for the second byte, for a total of 24,066 positions. Microsoft's Code Page 936 is generally thought of as being GBK. However, the 95 PUA characters added in GBK 1.0 are not included in Code Page 936. Code Page 936 also has a single-byte
euro sign The euro sign () is the currency sign used for the euro, the official currency of the eurozone. The design was presented to the public by the European Commission on 12 December 1996. It consists of a stylized letter E (or epsilon), crossed by ...
at 0x80 which GBK 1.0 doesn't have. GBK's successor, , uses the remaining range available to the second byte (–) to further expand the number of possibilities while retaining GBK as a subset.


References


Notes


External links


A scan of the GBK 1.0 specification
provided by the
Ideographic Research Group The Ideographic Research Group (IRG), formerly called the Ideographic Rapporteur Group, is a subgroup of Working Group 2 (WG2) of ISO/IEC JTC1 Subcommittee 2 (SC2), which is the committee responsible for developing the Universal Coded Character Se ...

ICU's Authoritative GBK mapping
- part o


Microsoft Reference page for GBK Mapping of GBK to Unicode
N.B.: this is Microsoft code page 936, which contains entries for 21791 double-byte code points, 96 single-byte graphic characters, and 33 control characters. This is not exactly the same as GBK which has 21886 characters.
GBK Code Table
N.B. This gbk-encoded page shows the available coding space totally populated except for 2 places, for a total of 32256 glyphs (32352 with the implied single-byte ASCII codes not illustrated), which is more than 23940 or 21886. Actual rendering of this table depends on your browser's GBK decoder. {{Character encoding Windows code pages Chinese character encodings