Big5
   HOME

TheInfoList



OR:

Big-5 or Big5 is a
Chinese character encoding In computing, Chinese character encodings can be used to represent text written in the CJK languages—Chinese, Japanese, Korean—and (rarely) obsolete Vietnamese, all of which use Chinese characters. Several general-purpose character enc ...
method used in
Taiwan Taiwan, officially the Republic of China (ROC), is a country in East Asia, at the junction of the East and South China Seas in the northwestern Pacific Ocean, with the People's Republic of China (PRC) to the northwest, Japan to the nort ...
,
Hong Kong Hong Kong ( (US) or (UK); , ), officially the Hong Kong Special Administrative Region of the People's Republic of China (abbr. Hong Kong SAR or HKSAR), is a city and special administrative region of China on the eastern Pearl River Delta i ...
, and
Macau Macau or Macao (; ; ; ), officially the Macao Special Administrative Region of the People's Republic of China (MSAR), is a city and special administrative region of China in the western Pearl River Delta by the South China Sea. With a p ...
for
traditional Chinese character Traditional Chinese characters are one type of standard Chinese character sets of the contemporary written Chinese. The traditional characters had taken shapes since the clerical change and mostly remained in the same structure they took at ...
s. The
People's Republic of China (PRC) China, officially the People's Republic of China (PRC), is a country in East Asia. It is the world's most populous country, with a population exceeding 1.4 billion, slightly ahead of India. China spans the equivalent of five time zones and ...
, which uses
simplified Chinese characters Simplified Chinese characters are standardized Chinese characters used in mainland China, Malaysia and Singapore, as prescribed by the ''Table of General Standard Chinese Characters''. Along with traditional Chinese characters, they are one o ...
, uses the
GB 18030 GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet n ...
character set instead. Big5 gets its name from the consortium of five companies in Taiwan that developed it.


Organization

The original Big5 character set is sorted first by usage frequency, second by stroke count, lastly by Kangxi radical. The original Big5 character set lacked many commonly used characters. To solve this problem, each vendor developed its own extension. The ETen extension became part of the current Big5 standard through popularity. The structure of Big5 does not conform to the
ISO 2022 ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/ IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the ...
standard, but rather bears a certain similarity to the encoding. It is a double-byte character set (DBCS) with the following structure: (the prefix 0x signifying hexadecimal numbers). Standard assignments (excluding vendor or user-defined extensions) do not use the bytes through , nor , as either lead (first) or trail (second) bytes. Bytes through are used for both lead and trail bytes for double-byte (Big5) codes. Bytes through are used as trail bytes following a lead byte, or for single-byte codes otherwise. If the second byte is not in either range, behavior is unspecified (i.e., varies from system to system). Additionally, certain variants of the Big5 character set, for example the
HKSCS The Hong Kong Supplementary Character Set (; commonly abbreviated to HKSCS) is a set of Chinese characters – 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong (whether in w ...
, use an expanded range for the lead byte, including values in the to range (similar to ), whereas others use reduced lead byte ranges (for instance, the Apple Macintosh variant uses through as single-byte codes, limiting the lead byte range to through ). The numerical value of individual Big5 codes are frequently given as a 4-digit hexadecimal number, which describes the two bytes that comprise the Big5 code as if the two bytes were a
big endian In computing, endianness, also known as byte sex, is the order or sequence of bytes of a word of digital data in computer memory. Endianness is primarily expressed as big-endian (BE) or little-endian (LE). A big-endian system stores the most sig ...
representation of a 16-bit number. For example, the Big5 code for a full-width space, which are the bytes , is usually written as or just A140. Strictly speaking, the Big5 encoding contains only DBCS characters. However, in practice, the Big5 codes are always used together with an unspecified, system-dependent single-byte character set (
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
, or an 8-bit character set such as code page 437), so that you will find a mix of DBCS characters and single-byte characters in Big5-encoded text. Bytes in the range to that are not part of a double-byte character are assumed to be single-byte characters. (For a more detailed description of this problem, please see the discussion on "The Matching SBCS" below.) The meaning of non-ASCII single bytes outside the permitted values that are not part of a double-byte character varies from system to system. In old MSDOS-based systems, they are likely to be displayed as 8-bit characters; in modern systems, they are likely to either give unpredictable results or generate an error.


A more detailed look at the organization

In the original Big5, the encoding is compartmentalized into different zones: The "graphical characters" actually comprise punctuation marks, partial punctuation marks (e.g., half of a dash, half of an ellipsis; see below),
dingbat In typography, a dingbat (sometimes more formally known as a printer's ornament or printer's character) is an ornament, specifically, a glyph used in typesetting, often employed to create box frames, (similar to box-drawing characters) or as ...
s, foreign characters, and other special characters (e.g., presentational "full width" forms, digits for
Suzhou numerals The Suzhou numerals, also known as ' (), is a numeral system used in China before the introduction of Arabic numerals. The Suzhou numerals are also known as ' (), ' (), ' (), ' () and ' (). History The Suzhou numeral system is the only survivin ...
,
zhuyin fuhao Bopomofo (), or Mandarin Phonetic Symbols, also named Zhuyin (), is a Chinese transliteration system for Mandarin Chinese and other related languages and dialects. More commonly used in Taiwanese Mandarin, it may also be used to transcribe ...
, etc.) In most vendor extensions, extended characters are placed in the various zones reserved for user-defined characters, each of which are normally regarded as associated with the preceding zone. For example, additional "graphical characters" (e.g., punctuation marks) would be expected to be placed in the – range, and additional logograms would be placed in either the – or the – range. Sometimes, this is not possible due to the large number of extended characters to be added; for example, Cyrillic letters and Japanese
kana The term may refer to a number of syllabaries used to write Japanese phonological units, morae. Such syllabaries include (1) the original kana, or , which were Chinese characters (kanji) used phonetically to transcribe Japanese, the most p ...
have been placed in the zone associated with "frequently-used characters".


What a Big5 code actually encodes

An individual Big5 code does not always represent a complete semantic unit. The Big5 codes of logograms are always logograms, but codes in the "graphical characters" section are not always complete "graphical characters". What Big5 encodes are particular graphical representations of characters or part of characters that happen to fit in the space taken by two monospaced ASCII characters. This is a property of double-byte character sets as normally used in CJK (Chinese, Japanese, and Korean) computing, and is not a unique problem of Big5. (The above might need some explanation by putting it in historical perspective, as it is ''theoretically'' incorrect: Back when text mode personal computing was still the norm, characters were normally represented as single bytes and each character takes one position on the screen. There was therefore a practical reason to insist that double-byte characters must take up two positions on the screen, namely that off-the-shelf, American-made software would then be usable without modification in a DBCS-based system. If a character can take an arbitrary number of screen positions, software that assumes that one ''byte'' of text takes one screen position would produce incorrect output. Of course, if a computer never had to deal with the text screen, the manufacturer would not enforce this artificial restriction; the Apple Macintosh is an example. Nevertheless, the encoding itself must be designed so that it works correctly on text-screen-based systems.) To illustrate this point, consider the Big5 code (…). To English speakers this looks like an ellipsis and the Unicode standard identifies it as such; however, in Chinese, the ellipsis consists of six dots that fit in the space of two Chinese characters (……), so in fact there is no Big5 code for the Chinese ellipsis, and the Big5 code just represents half of a Chinese ellipsis. It represents only half of an ellipsis because the whole ellipsis should take the space of two Chinese characters, and in many DBCS systems one DBCS character must take exactly the space of one Chinese character. Characters encoded in Big5 do not always represent things that can be readily used in plain text files; an example is "citation mark" (, ﹋), which is, when used, required to be typeset under the title of literary works. Another example is the
Suzhou numerals The Suzhou numerals, also known as ' (), is a numeral system used in China before the introduction of Arabic numerals. The Suzhou numerals are also known as ' (), ' (), ' (), ' () and ' (). History The Suzhou numeral system is the only survivin ...
, which is a form of
scientific notation Scientific notation is a way of expressing numbers that are too large or too small (usually would result in a long string of digits) to be conveniently written in decimal form. It may be referred to as scientific form or standard index form, o ...
that requires the number to be laid out in a 2-D form consisting of at least two rows.


The Matching SBCS

In practice, Big5 cannot be used without a matching Single Byte Character Set (SBCS); this is mostly to do with a compatibility reason. However, as in the case of other CJK DBCS character sets, the SBCS to use has never been specified. Big5 has always been defined as a DBCS, though when used it must be paired with a suitable, ''unspecified'' SBCS and therefore used as what some people call a MBCS; nevertheless, Big5 by itself, as defined, is strictly a DBCS. The SBCS to use being unspecified implies that the SBCS used can theoretically vary from system to system. Nowadays, ASCII is the only possible SBCS one would use. However, in old
DOS DOS is shorthand for the MS-DOS and IBM PC DOS family of operating systems. DOS may also refer to: Computing * Data over signalling (DoS), multiplexing data onto a signalling channel * Denial-of-service attack (DoS), an attack on a communicat ...
-based systems, Code Page 437—with its extra special symbols in the control code area including position 127—was much more common. Yet, on a Macintosh system with the Chinese Language Kit, or on a Unix system running the cxterm terminal emulator, the SBCS paired with Big5 would not be Code Page 437. Outside the valid range of Big5, the old DOS-based systems would routinely interpret things according to the SBCS that is paired with Big5 on that system. In such systems, characters 127 to 160, for example, were very likely not avoided because they would produce invalid Big5, but used because they would be valid characters in Code Page 437. The modern characterization of Big5 as an MBCS consisting of the DBCS of Big5 plus the SBCS of ASCII is therefore historically incorrect and potentially flawed, as the choice of the matching SBCS was, and theoretically still is, quite independent of the flavour of Big5 being used.


History

The inability of
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
to support large character sets such as used for Chinese, Japanese and Korean led to governments and industry to find creative solutions to enable their languages to be rendered on computers. A variety of ad hoc and usually proprietary input methods led to efforts to develop a standard system. As a result, Big5 encoding was defined by the
Institute for Information Industry The Institute for Information Industry (III; , abbreviated ) was established in 1979 as a Non-Governmental Organization ( NGO) to support the communication sector in Taiwan under the supervision of the Republic of China Ministry of Economic Af ...
of Taiwan in 1984. The name "Big5" is in recognition that the standard emerged from collaboration of five of Taiwan's largest IT firms: Acer ( 宏碁);
MiTAC MiTAC Holdings Corporation, formerly MiTAC International Corp. () is a Taiwanese electronics company established 8 December 1982. It is a subsidiary of MiTAC-Synnex Group. Through a 100% stock swap from MiTAC International Corp., MiTAC Holdings ...
(神通); JiaJia (佳佳), ZERO ONE Technology (零壹 o
01tech
; and, First International Computer (FIC) (大眾). Big5 was rapidly popularized in Taiwan and worldwide among Chinese who used the traditional Chinese character set through its adoption in several commercial software packages, notably the
E-TEN E-TEN Information Systems Co., Ltd. (倚天資訊股份有限公司) was an electronics manufacturing company based in Taiwan, specializing in sophisticated handheld devices such as smartphones. Founded in 1985 in Taipei, E-TEN initially became ...
Chinese
DOS DOS is shorthand for the MS-DOS and IBM PC DOS family of operating systems. DOS may also refer to: Computing * Data over signalling (DoS), multiplexing data onto a signalling channel * Denial-of-service attack (DoS), an attack on a communicat ...
input system (
ETen Chinese System ETen Chinese System (倚天中文系統) was the most popular DOS-compatible traditional Chinese operating system before Chinese Windows 95. DOS did not support Chinese characters, which are not in Extended ASCII. Many companies in Taiwan develope ...
). The Republic of China government declared Big5 as their standard in mid-1980s since it was, by then, the ''de facto'' standard for using traditional Chinese on computers.


Extensions

The original Big-5 only include CJK logograms from the Charts of Standard Forms of Common National Characters (4808 characters) and Less-Than-Common National Characters (6343 characters), but not letters from people's names, place names, dialects, chemistry,
biology Biology is the scientific study of life. It is a natural science with a broad scope but has several unifying themes that tie it together as a single, coherent field. For instance, all organisms are made up of cells that process hereditary i ...
, Japanese
kana The term may refer to a number of syllabaries used to write Japanese phonological units, morae. Such syllabaries include (1) the original kana, or , which were Chinese characters (kanji) used phonetically to transcribe Japanese, the most p ...
. As a result, many Big-5 supporting software include extensions to address the problems. The plethora of variations make
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of ...
or
UTF-16 UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as cod ...
a more consistent code page for modern use.


Vendor extensions


ETen extensions

In ETen (倚天) Chinese operating system, the following code points are added, to add support for some characters present in the IBM 5550's code page but absent from generic Big5: * A3C0–A3E0: 33 control characters. * C6A1–C875: circle 1–10, bracket 1–10, Roman numerals 1–9 (i–ix), CJK radical glyphs, Japanese
hiragana is a Japanese syllabary, part of the Japanese writing system, along with ''katakana'' as well as ''kanji''. It is a phonetic lettering system. The word ''hiragana'' literally means "flowing" or "simple" kana ("simple" originally as contrast ...
, Japanese
katakana is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji and in some cases the Latin script (known as rōmaji). The word ''katakana'' means "fragmentary kana", as the katakana characters are derived f ...
, Cyrillic characters * F9D6–F9FE: the characters ' ', ' ', ' ', ' ', ' ', ' ' and ' ', followed by 34 additional
semigraphic Text-based semigraphics or pseudographics is a primitive method used in early text mode video hardware to emulate raster graphics without having to implement the logic for such a display mode. There are two different ways to accomplish the emu ...
symbols. In some versions of ETen, there are extra graphical symbols and
simplified Chinese characters Simplified Chinese characters are standardized Chinese characters used in mainland China, Malaysia and Singapore, as prescribed by the ''Table of General Standard Chinese Characters''. Along with traditional Chinese characters, they are one o ...
.


Microsoft code pages

Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washin ...
(微軟) created its own version of Big5 extension as
Code page 950 Code page 950 is the code page used on Microsoft Windows for Traditional Chinese. It is Microsoft's implementation of the ''de facto'' standard Big5 character encoding. The code page is not registered with IANA, and hence, it is not a standard ...
for use with Microsoft Windows, which supports the F9D6-F9FE code points from ETEN's extensions. In some versions of Windows, the
euro The euro ( symbol: €; code: EUR) is the official currency of 19 out of the member states of the European Union (EU). This group of states is known as the eurozone or, officially, the euro area, and includes about 340 million citizens . ...
currency symbol A currency symbol or currency sign is a graphic symbol used to denote a currency unit. Usually it is defined by the monetary authority, like the national central bank for the currency concerned. In formatting, the symbol can use various format ...
is mapped to Big-5 code point A3E1. After installing Microsoft'
HKSCS patch
on top of traditional Chinese Windows (or any version of Windows 2000 and above with proper language pack), applications using code page 950 automatically use a hidden code page 951 table. The table supports all code points in HKSCS-2001, except for the compatibility code points specified by the standard.


IBM code pages

In contrast to Microsoft's code page 950, IBM's
CCSID A CCSID (coded character set identifier) is a 16-bit number that represents a particular encoding of a specific code page. For example, Unicode is a code page that has several encoding (so called "transformation") forms, like UTF-8, UTF-16 and U ...
950 comprises single byte code page 1114 (CCSID 1114) and double byte code page 947 (CCSID 947). It incorporates ETEN extensions for lead bytes , , and , while omitting those with lead byte (which Microsoft includes), mapping them instead to the
Private Use Area In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the Unicode Consortium. Three private use areas are defined: one in the Basic Multilingual Plane (), and one each in, and nearl ...
as user-defined characters. It also includes two non-ETEN extension regions with trail bytes , i.e. outside the usual Big5 trail byte range but similar to the Big5+ trail byte range: area 5 has lead bytes and contains IBM-selected characters, while area 9 has lead bytes and is a user-defined region. IBM refers to the euro sign update of their Big-5 variant as CCSID 1370, which includes both single-byte () and double-byte () euro signs. It comprises single byte code page 1114 (CCSID 5210) and double byte code page 947 (CCSID 21427). For better compatibility with Microsoft's variant in IBM Db2, IBM also define the pure double-byte Code page 1372 and the associated variable-width CCSID 1373, which corresponds to Microsoft's code page 950. IBM assigns CCSID 5471 to the
HKSCS The Hong Kong Supplementary Character Set (; commonly abbreviated to HKSCS) is a set of Chinese characters – 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong (whether in w ...
-2001 Big5 code page (with CPGID 1374 as CCSID 5470 as the double byte component), CCSID 9567 to the HKSCS-2004 code page (with CPGID 1374 as CCSID 9566 as the double byte component), and CCSID 13663 to the HKSCS-2008 code page (with CPGID 1374 as CCSID 13662 as the double byte component), while CCSID 1375 is assigned to a growing HKSCS code page, currently equivalent to CCSID 13663.


ChinaSea font

ChinaSea fonts (中國海字集) are Traditional Chinese fonts made by ChinaSea. The fonts are rarely sold separately, but are bundled with other products, such as the Chinese version of
Microsoft Office 97 Microsoft Office 97 (version 8.0) is the fifth major release for Windows of Microsoft Office, released by Microsoft on November 19, 1996. It succeeded Microsoft Office 95 and was replaced by Microsoft Office 2000 in 1999. A Mac OS equivalent, ...
. The fonts support Japanese
kana The term may refer to a number of syllabaries used to write Japanese phonological units, morae. Such syllabaries include (1) the original kana, or , which were Chinese characters (kanji) used phonetically to transcribe Japanese, the most p ...
, kokuji, and other characters missing in Big-5. As a result, the ChinaSea extensions have become more popular than the government-supported extensions. Some Hong Kong BBSes had used encodings in ChinaSea fonts before the introduction of
HKSCS The Hong Kong Supplementary Character Set (; commonly abbreviated to HKSCS) is a set of Chinese characters – 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong (whether in w ...
.


'Sakura' font

Th
'Sakura' font
(日和字集 Sakura Version) is developed in Hong Kong and is designed to be compatible with
HKSCS The Hong Kong Supplementary Character Set (; commonly abbreviated to HKSCS) is a set of Chinese characters – 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong (whether in w ...
. It adds support for kokuji and proprietary
dingbat In typography, a dingbat (sometimes more formally known as a printer's ornament or printer's character) is an ornament, specifically, a glyph used in typesetting, often employed to create box frames, (similar to box-drawing characters) or as ...
s (including
Doraemon ''Doraemon'' ( ja, ドラえもん ) is a Japanese manga series written and illustrated by Fujiko F. Fujio. The manga was first serialized in December 1969, with its 1,345 individual chapters compiled into 45 ''tankōbon'' volumes and ...
) not found in HKSCS.


Unicode-at-on

Unicode-at-on ( Unicode補完計畫), formerly BIG5 extension, extends BIG-5 by altering code page tables, but uses the ChinaSea extensions starting with version 2. However, with the bankruptcy of ChinaSea, late development, and the increasing popularity of
HKSCS The Hong Kong Supplementary Character Set (; commonly abbreviated to HKSCS) is a set of Chinese characters – 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong (whether in w ...
and
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
(the project is not compatible with HKSCS), the success of this extension is limited at best. Despite the problems, characters previously mapped to Unicode Private Use Area are remapped to the standardized equivalents when exporting characters to Unicode format.


OPG

The web sites of the
Oriental Daily News ''Oriental Daily News'' is a Chinese-language newspaper in Hong Kong. It was established in 1969 by Ma Sik-yu and Ma Sik-chun, and was one of the two newspapers published by the Oriental Press Group Limited (). Relative to other Hong Kong ...
and Sun Daily, belonging to the
Oriental Press Group Limited Oriental Press Group Limited is the publishing company of Hong Kong newspaper ''Oriental Daily News'', as well as now defunct '' The Sun'' and ''Eastern Express''. Oriental Press Group was the founding company of the magazines '' East Week'', '' E ...
(東方報業集團有限公司) in Hong Kong, used a downloadable font with a different Big-5 extension coding than the
HKSCS The Hong Kong Supplementary Character Set (; commonly abbreviated to HKSCS) is a set of Chinese characters – 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong (whether in w ...
.


Official extensions


Taiwan Ministry of Education font

The Taiwan Ministry of Education supplied its own font, the Taiwan Ministry of Education font (臺灣教育部造字檔) for use internally.


Taiwan Council of Agriculture font

Taiwan's Council of Agriculture font, Executive Yuan introduced a 133-character custom font, the Taiwan Council of Agriculture font (臺灣農委會常用中文外字集) that includes 84 characters from the fish radical and 7 from the bird radical.


Big5+

The Chinese Foundation for Digitization Technology (中文數位化技術推廣委員會) introduced Big5+ in 1997, which used over 20000 code points to incorporate all CJK logograms in Unicode 1.1. However, the extra code points exceeded the original Big-5 definition (Big5+ uses high byte values 81-FE and low byte values 40-7E and 80-FE), preventing it from being installed on Microsoft Windows without new codepage files.


Big-5E

To allow Windows users to use custom fonts, the Chinese Foundation for Digitization Technology introduced Big-5E, which added 3954 characters (in three blocks of code points: 8E40-A0FE, 8140-86DF, 86E0-875C) and removed the Japanese kana from the ETEN extension. Unlike Big-5+, Big5E extends Big-5 within its original definition. Mac OS X 10.3 and later supports Big-5E in the fonts LiHei Pro (儷黑 Pro.ttf) and LiSong Pro (儷宋 Pro.ttf).


Big5-2003

The Chinese Foundation for Digitization Technology made a Big5 definition and put it into
CNS 11643 The CNS 11643 character set (Chinese National Standard 11643), also officially known as the Chinese Standard Interchange Code or CSIC ( zh, tr=, t=中文標準交換碼), is officially the standard character set of Taiwan (Republic of China). In p ...
in note form, making it part of the official standard in Taiwan. Big5-2003 incorporates all Big-5 characters introduced in the 1984 ETEN extensions (code points A3C0-A3E0, C6A1-C7F2, and F9D6-F9FE) and the Euro symbol. Cyrillic characters were not included because the authority claimed CNS 11643 does not include such characters.


CDP

The Academia Sinica made a Chinese Data Processing font (漢字構形資料庫) in late 1990s, which the latest release version 2.5 included 112,533 characters, some less than the Mojikyo fonts.


HKSCS

Hong Kong Hong Kong ( (US) or (UK); , ), officially the Hong Kong Special Administrative Region of the People's Republic of China (abbr. Hong Kong SAR or HKSAR), is a city and special administrative region of China on the eastern Pearl River Delta i ...
also adopted Big5 for character encoding. However,
written Cantonese Written Cantonese is the most complete written form of Chinese after that for Mandarin Chinese and Classical Chinese. Written Chinese was originally developed for Classical Chinese, and was the main literary language of China until the 19th cent ...
has its own characters not available in the normal Big5 character set. To solve this problem, the Hong Kong Government created the Big5 extensions Government Chinese Character Set (GCCS) in 1995 and
Hong Kong Supplementary Character Set The Hong Kong Supplementary Character Set (; commonly abbreviated to HKSCS) is a set of Chinese characters – 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong (whether in w ...
in 1999. The Hong Kong extensions were commonly distributed as a patch. It is still being distributed as a patch by Microsoft, but a full Unicode font is also available from the Hong Kong Government's web site. There are two encoding schemes of HKSCS: one encoding scheme is for the Big-5 coding standard and the other is for the
ISO 10646 ISO is the most common abbreviation for the International Organization for Standardization. ISO or Iso may also refer to: Business and finance * Iso (supermarket), a chain of Danish supermarkets incorporated into the SuperBest chain in 2007 * Iso ...
standard. Subsequent to the initial release, there are also HKSCS-2001 and HKSCS-2004. The HKSCS-2004 is aligned technically with the ISO/IEC 10646:2003 and its Amendment 1 published in April 2004 by the International Organization for Standardization (ISO). HKSCS includes all the characters from the common ETEN extension, plus some characters from
simplified Chinese Simplification, Simplify, or Simplified may refer to: Mathematics Simplification is the process of replacing a mathematical expression by an equivalent one, that is simpler (usually shorter), for example * Simplification of algebraic expressions, ...
, place names, people's names, and Cantonese phrases (including
profanity Profanity, also known as cursing, cussing, swearing, bad language, foul language, obscenities, expletives or vulgarism, is a socially offensive use of language. Accordingly, profanity is language use that is sometimes deemed impolite, rud ...
). , the most recent edition of HKSCS is HKSCS-2016; however, the last edition of HKSCS to encode all of its characters in Big5 was HKSCS-2008, while the characters added in more recent editions are mapped to ISO 10646 /
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
only (as a
CJK Unified Ideographs The Chinese, Japanese and Korean (CJK) scripts share a common background, collectively known as CJK characters. In the process called Han unification, the common (shared) characters were identified and named CJK Unified Ideographs. As of Unicode ...
horizontal glyph extension where appropriate). Additionally, similarly to Hong Kong's situation, there are also characters that are needed by Macao but is neither included in Big5 nor HKSCS, hence, the ''Macao Supplementary Character Set'' was developed, comprising characters not found in Big5 or HKSCS; this, however, is also not encoded in Big5. The first batch of 121 MSCS characters were submitted for inclusion in of mapping to Unicode in 2009, and the first final version of MSCS was established in 2020.


Kana and Cyrillic

There are two major Big5 extension layouts for encoding
kana The term may refer to a number of syllabaries used to write Japanese phonological units, morae. Such syllabaries include (1) the original kana, or , which were Chinese characters (kanji) used phonetically to transcribe Japanese, the most p ...
, Russian Cyrillic and list markers in the range 0xC6A1 through 0xC875. These are not compatible with one another. They are compared in the table below. The ETEN layout of kana and Cyrillic is also used by the
HKSCS The Hong Kong Supplementary Character Set (; commonly abbreviated to HKSCS) is a set of Chinese characters – 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong (whether in w ...
(including
HTML5 HTML5 is a markup language used for structuring and presenting content on the World Wide Web. It is the fifth and final major HTML version that is a World Wide Web Consortium (W3C) recommendation. The current specification is known as the HTML ...
) and Unicode-At-On variants, as well as by IBM's version of code page 950, and the ETEN layout of the kana (with Cyrillic omitted) is also used by the Big5-2003 variant. The published mapping files for
Windows-950 Code page 950 is the code page used on Microsoft Windows for Traditional Chinese. It is Microsoft's implementation of the ''de facto'' standard Big5 character encoding. The code page is not registered with Internet Assigned Numbers Authority, IANA ...
include neither, and this Big5 range is mapped to the
Private Use Area In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the Unicode Consortium. Three private use areas are defined: one in the Basic Multilingual Plane (), and one each in, and nearl ...
by the Windows-950 implementation from
International Components for Unicode International Components for Unicode (ICU) is an open-source project of mature C/ C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and environ ...
. The
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
's built-in codec implementation is using the BIG5.TXT layout.Script showing output of cp950 codec for lead bytes 0xC6 and 0xC7
/ref> The classic Mac OS version includes neither layout.


See also

*
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
*
Han unification Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a featur ...
*
Chinese input methods for computers Chinese input methods are methods that allow a computer user to input Chinese characters. Most, if not all, Chinese input methods fall into one of two categories: phonetic readings or root shapes. Methods under the phonetic category usually are e ...


References

*


External links


Mozilla and the Big5 Family of Encodings
an overview of Big5 encodings with code charts for each extension and relevant Firefox bugs (Traditional Chinese)

by Christian Wittern
CNS 11643 official web site
has information about the Big5e character set (an extended version of Big5) in the "Chinese Information Code" section.
Big5 introduction
Contains differences between extensions.
Graphical View of Big5 in ICU's Converter Explorer教育部標準字體
Download page of the Taiwan Ministry of Education fonts
文獻處理實驗室
Download pages of the CDP font
Hong Kong Supplementary Character Set Info
Downloadable HKSCS documents & font
香港參考宋體
Download page of Dynalab(華康科技有限公司)'s HKSCS font.
Microsoft's Windows Codepage 950
(Traditional Chinese Big5)

Download page of the OPG font

Download page of the ChinaSea font
Big5 Codeset Overview
{{Character encoding Character sets Encodings of Asian languages Chinese-language computing