HZ (character Encoding)
   HOME

TheInfoList



OR:

The HZ character encoding is an
encoding In communications and Data processing, information processing, code is a system of rules to convert information—such as a letter (alphabet), letter, word, sound, image, or gesture—into another form, sometimes data compression, shortened or ...
of
GB 2312 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. ''GB'' refers to the Guobiao standards (国家标准), ...
that was formerly commonly used in email and
USENET Usenet (), a portmanteau of User's Network, is a worldwide distributed discussion system available on computers. It was developed from the general-purpose UUCP, Unix-to-Unix Copy (UUCP) dial-up network architecture. Tom Truscott and Jim Elli ...
postings. It was designed in 1989 by Fung Fung Lee () of
Stanford University Leland Stanford Junior University, commonly referred to as Stanford University, is a Private university, private research university in Stanford, California, United States. It was founded in 1885 by railroad magnate Leland Stanford (the eighth ...
, and subsequently codified in 1995 into RFC 1843. The HZ, short for ''
Hanzi Chinese characters are logographs used to write the Chinese languages and others from regions historically influenced by Chinese culture. Of the four independently invented writing systems accepted by scholars, they represent the only one ...
'' (), encoding was invented to facilitate the use of Chinese characters through e-mail, which at that time only allowed 7-bit characters. Therefore, in lieu of standard ISO 2022 escape sequences (as in the case of
ISO-2022-JP ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an International Organization for Standardization, ISO/International Electrotechnical Commission, IEC standard in the field of character encoding. It ...
) or 8-bit characters (as in the case of EUC), the HZ code uses only printable, 7-bit characters to represent Chinese characters. It was also popular in USENET networks, which in the late 1980s and early 1990s, generally did not allow transmission of 8-bit characters or escape characters.


History

HZ superseded the earlier "zW" encoding, which marked entire lines as being GB 2312 text by beginning them with the characters zW.


Structure and use

In the HZ encoding system, the character sequences "~" act as escape sequences; anything between them is interpreted as Chinese encoded in GB 2312 (the most significant bits are ignored). Outside the escape sequences, characters are assumed to be
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
. An example will help illustrate the relationship between
GB 2312 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. ''GB'' refers to the Guobiao standards (国家标准), ...
, EUC-CN, and the HZ code: HZ was originally designed to be used purely as a 7-bit code. However, when situations allow, the escape sequences "~" sometimes surround characters represented in EUC-CN; this alternative use allows Chinese to be readable either with the help of HZ decoder software, or with a system that understands EUC-CN. Additionally, the specification defines that: * the sequence "~~" is to be treated as encoding a single ASCII "~" and, * the character "~" followed by a newline is to be discarded. However, not all HZ decoders follow these two rules.


HZ encoders and decoders

The first HZ encoder and decoder were written in 1989 by the code's inventor for the
Unix Unix (, ; trademarked as UNIX) is a family of multitasking, multi-user computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, a ...
operating system. The program, also for the
Unix Unix (, ; trademarked as UNIX) is a family of multitasking, multi-user computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, a ...
operating system, was also among the first and one of the most popular HZ decoders. It deviates from the specification in that it will display the escape sequences (i.e., "~"), and it does not treat "~~" and "~" followed by a newline specially. This was probably to allow software which assumes one character to occupy one screen position (on a text screen) to function correctly without modification. Support on
Microsoft Windows Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...
came later, and a number of third-party "Chinese systems" support HZ. These systems may provide an option to hide the escape sequences.


Disadvantages

Because of its escape sequences, and furthermore because its escape delimiters are printable characters in ASCII, it is fairly easy to construct attack byte sequences that round-trip from HZ to Unicode and back. Use of HZ encoding is thus treated as suspicious by malware protection suites.


References

{{character encoding Chinese character encodings