Triple-byte Character Set
   HOME

TheInfoList



OR:

A double-byte character set (DBCS) is a
character encoding Character encoding is the process of assigning numbers to graphical character (computing), characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical v ...
in which either all characters (including
control characters In computing and telecommunications, a control character or non-printing character (NPC) is a code point in a character set that does not represent a written character or symbol. They are used as in-band signaling to cause effects other than ...
) are encoded in two bytes, or merely every
graphic character In ISO/IEC 646 (commonly known as ASCII) and related standards including ISO 8859 and Unicode, a graphic character, also known as printing character (or printable character), is any character intended to be written, printed, or otherwise display ...
not representable by an accompanying single-byte character set ( SBCS) is encoded in two
bytes The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...
(
Han characters Chinese characters are logographs used to write the Chinese languages and others from regions historically influenced by Chinese culture. Of the four independently invented writing systems accepted by scholars, they represent the only one ...
would generally comprise most of these two-byte characters). A DBCS supports national languages that contain many unique characters or symbols (the maximum number of characters that can be represented with one byte is
256 Year 256 ( CCLVI) was a leap year starting on Tuesday of the Julian calendar. At the time, it was known as the Year of the Consulship of Claudius and Glabrio (or, less frequently, year 1009 ''Ab urbe condita''). The denomination 256 for this y ...
characters, while two bytes can represent up to
65,536 65536 is the natural number following 65535 and preceding 65537. 65536 is a power of two: 2^ (2 to the 16th power). 65536 is the smallest number with exactly 17 divisors (but there are smaller numbers with more than 17 divisors; e.g., 180 ha ...
characters). Examples of such languages include
Japanese Japanese may refer to: * Something from or related to Japan, an island country in East Asia * Japanese language, spoken mainly in Japan * Japanese people, the ethnic group that identifies with Japan through ancestry or culture ** Japanese diaspor ...
and Chinese.
Hangul The Korean alphabet is the modern writing system for the Korean language. In North Korea, the alphabet is known as (), and in South Korea, it is known as (). The letters for the five basic consonants reflect the shape of the speech organs ...
does not contain as many characters, but
KS X 1001 KS X 1001, "''Code for Information Interchange (Hangul and Hanja)''", formerly called KS C 5601, is a South Korean coded character set standard to represent Hangul and Hanja characters on a computer. KS X 1001 is encoded by the most common leg ...
supports both Hangul and
Hanja Hanja (; ), alternatively spelled Hancha, are Chinese characters used to write the Korean language. After characters were introduced to Korea to write Literary Chinese, they were adapted to write Korean as early as the Gojoseon period. () ...
, and uses two bytes per character.


In CJK computing

The term ''DBCS'' traditionally refers to a character encoding where each graphic character is encoded in two bytes. In an 8-bit code, such as
Big-5 Big-5 or Big5 ( zh, t=大五碼) is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters. The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 ...
or
Shift JIS Shift JIS (also SJIS, MIME name Shift_JIS, known as PCK in Solaris contexts) is a character encoding for the Japanese language, originally developed by the Japanese company ASCII Corporation in conjunction with Microsoft and standardized as JIS ...
, a character from the DBCS is represented with a lead (first) byte with the
most significant bit In computing, bit numbering is the convention used to identify the bit positions in a binary numeral system, binary number. Bit significance and indexing In computing, the least significant bit (LSb) is the bit position in a Binary numeral sy ...
set (i.e., being greater than seven bits), and paired up with a single-byte character-set (SBCS). For the practical reason of maintaining compatibility with unmodified, off-the-shelf software, the SBCS is associated with
half-width character In CJK (Chinese, Japanese, and Korean) computing, graphic characters are traditionally classed into fullwidth and halfwidth characters. Unlike monospaced fonts, a halfwidth character occupies half the width of a fullwidth character, hence the na ...
s and the DBCS with full-width characters. In a 7-bit code such as
ISO-2022-JP ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an International Organization for Standardization, ISO/International Electrotechnical Commission, IEC standard in the field of character encoding. It ...
,
escape sequences In computer science, an escape sequence is a combination of characters that has a meaning other than the literal characters contained therein; it is marked by one or more preceding (and possibly terminating) characters. Examples * In C and ma ...
or shift codes are used to switch between the SBCS and DBCS. Sometimes, the use of the term "DBCS" can imply an underlying structure that does not comply with
ISO 2022 ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/IEC standard in the field of character encoding. It is equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanes ...
. For example, "DBCS" can sometimes mean a double-byte encoding that is specifically not
Extended Unix Code Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese (characters). The most commonly used EUC codes are variable-length encodings with a character belonging to an compl ...
(EUC). This original meaning of DBCS is different from what some consider correct usage today. Some insist that these character encodings be properly called
multi-byte character set A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set (a repertoire of symbols) for representation, usually in a computer. Most common variable-width encodings are ...
s (MBCS) or
variable-width encoding A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set (a repertoire of symbols) for representation, usually in a computer. Most common variable-width encodings are ...
s, because character encodings such as
EUC-JP Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese (characters). The most commonly used EUC codes are variable-length encodings with a character belonging to an compl ...
,
EUC-KR Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese language, Japanese, Korean language, Korean, and simplified Chinese characters, simplified Chinese (characters). The most commonly used EUC codes are va ...
,
EUC-TW Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese language, Japanese, Korean language, Korean, and simplified Chinese characters, simplified Chinese (characters). The most commonly used EUC codes are va ...
,
GB 18030 GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet n ...
, and
UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8. UTF-8 supports all 1,112,0 ...
use more than two bytes for some characters, and they support one byte for other characters.


Ambiguity

Some people use DBCS to mean the
UTF-16 UTF-16 (16-bit Unicode Transformation Format) is a character encoding that supports all 1,112,064 valid code points of Unicode. The encoding is variable-length as code points are encoded with one or two ''code units''. UTF-16 arose from an earli ...
and
UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8. UTF-8 supports all 1,112,0 ...
encodings, while other people use the term DBCS to mean older (pre-
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
) character encodings that use more than one byte per character.
Shift JIS Shift JIS (also SJIS, MIME name Shift_JIS, known as PCK in Solaris contexts) is a character encoding for the Japanese language, originally developed by the Japanese company ASCII Corporation in conjunction with Microsoft and standardized as JIS ...
,
GB 2312 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. ''GB'' refers to the Guobiao standards (国家标准), ...
and
Big5 Big-5 or Big5 ( zh, t=大五碼) is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters. The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 ...
are a few character encodings that can contain more than one byte per character, but even using the term DBCS for these character encodings is incorrect terminology because these character encodings are really
variable-width encoding A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set (a repertoire of symbols) for representation, usually in a computer. Most common variable-width encodings are ...
s (as are both UTF-16 and UTF-8). Some
IBM International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
mainframes do have true DBCS code pages, which contain only the double byte portion of a multi-byte code page. If a person uses the term "DBCS enablement" for software
internationalization Internationalization or Internationalisation is the process of increasing involvement of enterprises in international markets, although there is no agreed definition of internationalization. Internationalization is a crucial strategy not only for ...
, they are using ambiguous terminology. They either mean they want to write software for
East Asian East Asia is a geocultural region of Asia. It includes China, Japan, Mongolia, North Korea, South Korea, and Taiwan, plus two special administrative regions of China, Hong Kong and Macau. The economies of Economy of China, China, Economy of Ja ...
markets using older technology with code pages, or they are planning on using Unicode. Sometimes this term also implies
translation Translation is the communication of the semantics, meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The English la ...
into an East Asian language. Usually "Unicode enablement" means internationalizing software by using Unicode, and "DBCS enablement" means using incompatible character encodings that exist between the various countries in East Asia for internationalizing software. Since Unicode, unlike many other character encodings, supports all the major languages in East Asia, it is generally easier to enable and maintain software that uses Unicode. DBCS (non-Unicode) enablement is usually only desired when much older operating systems or applications do not support Unicode.


TBCS

A triple-byte character set (TBCS) is a character encoding in which characters (including control characters) are encoded in three bytes.


See also

*
Variable-width encoding A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set (a repertoire of symbols) for representation, usually in a computer. Most common variable-width encodings are ...
(also known as MBCS – multi-byte character set) *
DOS/V DOS/V is a Japanese computing initiative starting in 1990 to allow DOS on IBM PC compatibles with VGA cards to handle Double-byte character set, double-byte (DBCS) Japanese text via software alone. It was initially developed from PC DOS by IBM f ...


External links


Microsoft's definition of "double-byte character set"
* {{character encoding Character encoding