Japanese language and computers
   HOME

TheInfoList



OR:

In relation to the Japanese language and computers many adaptation issues arise, some unique to Japanese and others common to
language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by which humans communicate, and may be conveyed through a variety of ...
s which have a very large number of characters. The number of characters needed in order to write in English is quite small, and thus it is possible to use only one
byte The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable uni ...
(28=256 possible values) to encode each English character. However, the number of characters in Japanese is many more than 256 and thus cannot be encoded using a single byte - Japanese is thus encoded using two or more bytes, in a so-called "double byte" or "multi-byte" encoding. Problems that arise relate to
transliteration Transliteration is a type of conversion of a text from one script to another that involves swapping letters (thus ''trans-'' + '' liter-'') in predictable ways, such as Greek → , Cyrillic → , Greek → the digraph , Armenian → or L ...
and
romanization Romanization or romanisation, in linguistics, is the conversion of text from a different writing system to the Roman (Latin) script, or a system for doing so. Methods of romanization include transliteration, for representing written text, a ...
, character encoding, and input of Japanese text.


Character encodings

There are several standard methods to
encode The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims to identify functional elements in the human genome. ENCODE also supports further biomedical research by "generating community resources of genomics data, software ...
Japanese characters for use on a computer, including JIS,
Shift-JIS Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS, known as PCK in Oracle Solaris, Solaris contexts) is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporati ...
, EUC, and
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
. While mapping the set of
kana The term may refer to a number of syllabaries used to write Japanese phonological units, morae. Such syllabaries include (1) the original kana, or , which were Chinese characters ( kanji) used phonetically to transcribe Japanese, the most ...
is a simple matter,
kanji are the logographic Chinese characters taken from the Chinese script and used in the writing of Japanese. They were made a major part of the Japanese writing system during the time of Old Japanese and are still used, along with the subsequ ...
has proven more difficult. Despite efforts, none of the encoding schemes have become the de facto standard, and multiple encoding standards were in use by the 2000s. As of 2017, the share of
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of e ...
traffic on the Internet has expanded to over 90 % worldwide, and only 1.2% was for using Shift-JIS and EUC. Yet, a few popular websites including
2channel , also known as 2ch, Channel 2, and sometimes retrospectively as 2ch.net, was an anonymous Japanese textboard founded in 1999 by Hiroyuki Nishimura. Described in 2007 as "Japan's most popular online community", the site had a level of influ ...
and
kakaku.com is a Japanese company that operates , a comparison shopping website, and other services. About The company was established in 1997. It is listed on the Tokyo Stock Exchange (). Kakaku.com is Japan’s largest price comparison site. The represen ...
are still using Shift-JIS. Until 2000s, most Japanese
email Electronic mail (email or e-mail) is a method of exchanging messages ("mail") between people using electronic devices. Email was thus conceived as the electronic ( digital) version of, or counterpart to, mail, at a time when "mail" mean ...
s were in
ISO-2022-JP ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the ...
("JIS encoding") and web pages in
Shift-JIS Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS, known as PCK in Oracle Solaris, Solaris contexts) is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporati ...
and mobile phones in Japan usually used some form of
Extended Unix Code Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese. The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded char ...
. If a program fails to determine the encoding scheme employed, it can cause and thus unreadable text on computers. The first encoding to become widely used was JIS X 0201, which is a single-byte encoding that only covers standard 7-bit
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
characters with
half-width katakana are katakana characters displayed compressed at half their normal width (a 1:2 aspect ratio), instead of the usual square (1:1) aspect ratio. For example, the usual (full-width) form of the katakana ''ka'' is カ while the half-width form is カ. ...
extensions. This was widely used in systems that were neither powerful enough nor had the storage to handle kanji (including old embedded equipment such as cash registers) because Kana-Kanji conversion required a complicated process, and output in kanji required much memory and high resolution. This means that only katakana, not kanji, was supported using this technique. Some embedded displays still have this limitation. The development of kanji encodings was the beginning of the split. Shift JIS supports kanji and was developed to be completely backward compatible with JIS X 0201, and thus is in much embedded electronic equipment. However, Shift JIS has the unfortunate property that it often breaks any parser (software that reads the coded text) that is not specifically designed to handle it. For example, some Shift-JIS characters include a backslash (0x5C "\") in the second byte, which is used as an
escape character In computing and telecommunication, an escape character is a character (computing), character that invokes an alternative interpretation on the following characters in a character sequence. An escape character is a particular case of metacharac ...
in many programming languages. A parser lacking support for Shift JIS will recognize 0x5C 0x82 as an invalid escape sequence, and remove it. Therefore, the phrase cause mojibake. This can happen for example in the C programming language, when having Shift-JIS in text strings. It does not happen in HTML since ASCII 0x00–0x3F (which includes ", %, & and some other used escape characters and string separators) do not appear as second byte in Shift-JIS, and backslash is not an escape characters there. But it can happen for
JavaScript JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of websites use JavaScript on the client side for webpage behavior, of ...
which can be embedded in HTML pages. EUC, on the other hand, is handled much better by parsers that have been written for 7-bit ASCII (and thus EUC encodings are used on UNIX, where much of the file-handling code was historically only written for English encodings). But EUC is not backwards compatible with JIS X 0201, the first main Japanese encoding. Further complications arise because the original Internet e-mail standards only support 7-bit transfer protocols. Thus ("
ISO-2022-JP ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the ...
", often simply called JIS encoding) was developed for sending and receiving e-mails. In
character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values tha ...
standards such as JIS, not all required characters are included, so gaiji ( "external characters") are sometimes used to supplement the character set. Gaiji may come in the form of external font packs, where normal characters have been replaced with new characters, or the new characters have been added to unused character positions. However, gaiji are not practical in
Internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, p ...
environments since the font set must be transferred with text to use the gaiji. As a result, such characters are written with similar or simpler characters in place, or the text may need to be encoded using a larger character set (such as Unicode) that supports the required character.
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
was intended to solve all encoding problems over all languages. The
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of e ...
encoding used to encode Unicode in web pages does not have the disadvantages that Shift-JIS has. Unicode is supported by international software, and it eliminates the need for gaiji. There are still controversies, however. For Japanese, the kanji characters have been unified with Chinese; that is, a character considered to be the same in both Japanese and Chinese is given a single number, even if the appearance is actually somewhat different, with the precise appearance left to the use of a locale-appropriate font. This process, called
Han unification Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a featur ...
, has caused controversy. The previous encodings in Japan,
Taiwan Area The free area of the Republic of China, also known as the "Taiwan Area of the Republic of China", "Tai-Min Area (Taiwan and Fujian)" or simply the "Taiwan Area", is a term used by the government of the Republic of China (ROC) to refer to ...
,
Mainland China "Mainland China" is a geopolitical term defined as the territory governed by the China, People's Republic of China (including islands like Hainan or Chongming Island, Chongming), excluding dependent territories of the PRC, and other territorie ...
and
Korea Korea ( ko, 한국, or , ) is a peninsular region in East Asia. Since 1945, it has been divided at or near the 38th parallel, with North Korea (Democratic People's Republic of Korea) comprising its northern half and South Korea (Republic ...
have only handled one language and Unicode should handle all. The handling of Kanji/Chinese have however been designed by a committee composed of representatives from all four countries/areas.


Text input

Written Japanese uses several different scripts:
kanji are the logographic Chinese characters taken from the Chinese script and used in the writing of Japanese. They were made a major part of the Japanese writing system during the time of Old Japanese and are still used, along with the subsequ ...
(Chinese characters), 2 sets of ''kana'' (phonetic syllabaries) and roman letters. While kana and roman letters can be typed directly into a computer, entering kanji is a more complicated process as there are far more kanji than there are keys on most keyboards. To input kanji on modern computers, the reading of kanji is usually entered first, then an input method editor (IME), also sometimes known as a front-end processor, shows a list of candidate kanji that are a phonetic match, and allows the user to choose the correct kanji. More-advanced IMEs work not by word but by phrase, thus increasing the likelihood of getting the desired characters as the first option presented. Kanji readings inputs can be either via
romanization Romanization or romanisation, in linguistics, is the conversion of text from a different writing system to the Roman (Latin) script, or a system for doing so. Methods of romanization include transliteration, for representing written text, a ...
('' rōmaji nyūryoku,'' ) or direct kana input (''kana nyūryoku,'' ). Romaji input is more common on PCs and other full-size keyboards (although direct input is also widely supported), whereas direct kana input is typically used on mobile phones and similar devices – each of the 10 digits (1–9,0) corresponds to one of the 10 columns in the
gojūon In the Japanese language, the is a traditional system ordering kana characters by their component phonemes, roughly analogous to alphabetical order. The "fifty" (''gojū'') in its name refers to the 5×10 grid in which the characters are disp ...
table of kana, and multiple presses select the row. There are two main systems for the
romanization Romanization or romanisation, in linguistics, is the conversion of text from a different writing system to the Roman (Latin) script, or a system for doing so. Methods of romanization include transliteration, for representing written text, a ...
of Japanese, known as ''
Kunrei-shiki is the Cabinet-ordered romanization system for transcribing the Japanese language into the Latin alphabet. Its name is rendered ''Kunreisiki rômazi'' in the system itself. Kunrei-shiki is sometimes known as the Monbushō system in English bec ...
'' and '' Hepburn''; in practice, "keyboard romaji" (also known as '' wāpuro rōmaji'' or "word processor romaji") generally allows a loose combination of both. IME implementations may even handle keys for letters unused in any romanization scheme, such as ''L'', converting them to the most appropriate equivalent. With kana input, each key on the keyboard directly corresponds to one kana. The JIS keyboard system is the national standard, but there are alternatives, like the thumb-shift keyboard, commonly used among professional typists.


Direction of text

Japanese can be written in two directions. ''Yokogaki'' style writes left-to-right, top-to-bottom, as with English. ''Tategaki'' style writes first top-to-bottom, and then moves right-to-left. To compete with Ichitaro, Microsoft provided several updates for early Japanese versions of
Microsoft Word Microsoft Word is a word processor, word processing software developed by Microsoft. It was first released on October 25, 1983, under the name ''Multi-Tool Word'' for Xenix systems. Subsequent versions were later written for several other pla ...
including support for downward text, such as Word 5.0 Power Up Kit and Word 98.
QuarkXPress QuarkXPress is a desktop publishing software for creating and editing complex page layouts in a WYSIWYG (What You See Is What You Get) environment. It runs on macOS and Windows. It was first released by Quark, Inc. in 1987 and is still owned and ...
was the most popular DTP software in Japan in 1990s, even it had a long development cycle. However, due to lacking support for downward text, it was surpassed by
Adobe InDesign Adobe InDesign is a desktop publishing and page layout designing software application produced by Adobe Inc. and first released in 1999. It can be used to create works such as posters, flyers, brochures, magazines, newspapers, presentations, b ...
which had strong support for downward text through several updates. At present, handling of downward text is incomplete. For example,
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaS ...
has no support for ''tategaki'' and Japanese users must use HTML tables to simulate it. However,
CSS Cascading Style Sheets (CSS) is a style sheet language used for describing the presentation of a document written in a markup language such as HTML or XML (including XML dialects such as SVG, MathML or XHTML). CSS is a cornerstone technolo ...
level 3 includes a property "writing-mode" which can render ''tategaki'' when given the value "vertical-rl" (i.e. top to bottom, right to left). Word processors and DTP software have more complete support for it.


See also

*
Japanese writing system The modern Japanese writing system uses a combination of logographic kanji, which are adopted Chinese characters, and syllabic kana. Kana itself consists of a pair of syllabaries: hiragana, used primarily for native or naturalised Japane ...
*
Japanese language is spoken natively by about 128 million people, primarily by Japanese people and primarily in Japan, the only country where it is the national language. Japanese belongs to the Japonic or Japanese- Ryukyuan language family. There have been ...
*
CJK characters In internationalization, CJK characters is a collective term for the Chinese, Japanese, and Korean languages, all of which include Chinese characters and derivatives in their writing systems, sometimes paired with other scripts. Collectively, ...
* Korean language and computers *
Vietnamese language and computers The Vietnamese language is written with a Latin script with diacritics ( accent tones) which requires several accommodations when typing on phone or computers. Software-based systems are a form of writing Vietnamese on phones or computers with softw ...


References

{{Reflist


External links


Japanese Owned computer companies in United States
* ttp://examples.oreilly.com/cjkvinfo/doc/cjk.inf Chinese, Japanese, and Korean character set standards and encoding systems from 1996br>Japanese text encodingOnline Japanese Dictionary of LinguisticsOnline Japanese Dictionary
Japanese writing system Encodings of Japanese Natural language and computing