HOME

TheInfoList



OR:

Shift JIS (also SJIS,
MIME A mime artist, or simply mime (from Greek language, Greek , , "imitator, actor"), is a person who uses ''mime'' (also called ''pantomime'' outside of Britain), the acting out of a story through body motions without the use of speech, as a the ...
name Shift_JIS, known as PCK in
Solaris Solaris is the Latin word for sun. It may refer to: Arts and entertainment Literature, television and film * ''Solaris'' (novel), a 1961 science fiction novel by Stanisław Lem ** ''Solaris'' (1968 film), directed by Boris Nirenburg ** ''Sol ...
contexts) is a
character encoding Character encoding is the process of assigning numbers to graphical character (computing), characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical v ...
for the
Japanese language is the principal language of the Japonic languages, Japonic language family spoken by the Japanese people. It has around 123 million speakers, primarily in Japan, the only country where it is the national language, and within the Japanese dia ...
, originally developed by the
Japan Japan is an island country in East Asia. Located in the Pacific Ocean off the northeast coast of the Asia, Asian mainland, it is bordered on the west by the Sea of Japan and extends from the Sea of Okhotsk in the north to the East China Sea ...
ese company
ASCII Corporation was a Japanese publishing company based in Chiyoda, Tokyo. It became a subsidiary of Kadokawa Group Holdings in 2004, and merged with another Kadokawa subsidiary MediaWorks on April 1, 2008, becoming ASCII Media Works. The company published ' ...
in conjunction with
Microsoft Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...
and standardized as JIS X 0208 Appendix 1. Shift JIS is based on character sets defined within JIS standards (for the single-byte characters) and (for the double-byte characters). , less than 0.05% of surveyed web pages used Shift JIS (actually decoded as its superset Windows-31J encoding), a decline from 1.3% in July 2014. Shift JIS is the third-most declared character encoding for Japanese websites (though in effect it means its superset Windows-31J is used, so it is third-most popular), declared by 1.0% of sites in the .jp domain, while
UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8. UTF-8 supports all 1,112,0 ...
is used by 99% of Japanese websites. Shift JIS is also sometimes used in QR codes (they are a Japanese invention also allowing UTF-8, which may though be preferred use).


Structure

Shift JIS is an extension of the single-byte encoding , that uses unassigned code points in to encode the double-byte character set. The lead bytes for the double-byte characters are "shifted" around the 64 halfwidth
katakana is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji and in some cases the Latin script (known as rōmaji). The word ''katakana'' means "fragmentary kana", as the katakana characters are derived fr ...
characters in the single-byte range 0xA1 to 0xDF. The single-byte characters 0x00 to 0x7F match the
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
encoding, except for a
yen The is the official currency of Japan. It is the third-most traded currency in the foreign exchange market, after the United States dollar and the euro. It is also widely used as a third reserve currency after the US dollar and the euro. T ...
sign (U+00A5) at 0x5C and an
overline An overline, overscore, or overbar, is a typographical feature of a horizontal and vertical, horizontal line drawn immediately above the text. In old mathematical notation, an overline was called a ''vinculum (symbol), vinculum'', a notation fo ...
(U+203E) at 0x7E in place of the ASCII character set's backslash and tilde respectively (these deviations from ASCII align with ). The single-byte characters from 0xA1 to 0xDF map to the half-width katakana characters found in . For double-byte characters, the first byte is always in the range 0x81 to 0x9F or the range 0xE0 to 0xEF (these ranges are unassigned in ). If the first byte is odd, the second byte must be in the range 0x40 to 0x9E (but cannot be 0x7F); if the first byte is even, the second byte must in the range 0x9F to 0xFC. Shift JIS only guarantees that the first byte of two-byte characters will be high-bit-set (0x80–0xFF); the value of the second byte can be either high or low. The appearance of byte values 0x40–0x7E as second bytes of code words makes reliable Shift JIS detection difficult, because the same codes are used for ASCII characters. Since the same byte value can be either first or second byte, string searches are difficult, since simple searches can match the second byte of a character and the first byte of the next, which is not a valid Shift JIS character. String-searching algorithms must be tailor-made for .


Compatibility

Shift JIS is fully backwards compatible with the single-byte encoding, meaning that any valid string is also a valid Shift JIS string. Double-byte characters in need to be transformed in order to be encoded in Shift JIS. For a double-byte JIS X 0208 sequence j_1 j_2, the transformation to the corresponding Shift JIS bytes s_1 s_2 is: :s_1 = \begin \left \lfloor \frac \right \rfloor + 112 & \mbox 33 \le j_1 \le 94 \\ \left \lfloor \frac \right \rfloor + 176 & \mbox 95 \le j_1 \le 126 \end :s_2 = \begin j_2 + 31 + \left \lfloor \frac \right \rfloor & \mbox j_1 \mbox\\ j_2 + 126 & \mbox j_1 \mbox \end The competing 8-bit format
EUC-JP Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese (characters). The most commonly used EUC codes are variable-length encodings with a character belonging to an compl ...
, which does not support single-byte halfwidth katakana, allows for a cleaner and more direct conversion to and from JIS X 0208
code point A code point, codepoint or code position is a particular position in a Table (database), table, where the position has been assigned a meaning. The table may be one dimensional (a column), two dimensional (like cells in a spreadsheet), three dime ...
s, as all high-bit-set bytes are parts of a double-byte character and all codes from ASCII range represent single-byte characters.


Usage

HTML written in Shift JIS can still be interpreted to some extent when incorrectly tagged as ASCII, and when the charset tag is in the top of the document itself, since the important start and end of HTML tags and fields (<, >, /, ", &, ;) are encoded as the same bytes as in ASCII, and those bytes do not appear in two-byte sequences. Shift JIS can be used in
string literal string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where , "foo ...
s in programming languages such as C, but a few things must be taken into consideration. Firstly, that the
escape character In computing and telecommunications, an escape character is a character that invokes an alternative interpretation on the following characters in a character sequence. An escape character is a particular case of metacharacters. Generally, the ...
0x5C, normally
backslash The backslash is a mark used mainly in computing and mathematics. It is the mirror image of the common slash (punctuation), slash . It is a relatively recent mark, first documented in the 1930s. It is sometimes called a hack, whack, Escape c ...
, is the half-width
yen sign The yen and yuan sign (¥) is a currency sign used for the Japanese yen and the Chinese yuan currencies when writing in Latin scripts. This character resembles a capital letter Y with a single or double horizontal stroke. The symbol is usually ...
(¥) in Shift JIS. If the programmer is aware of this, it would be possible to use printf("ハローワールド¥n"); (where ハローワールド is Hello, world and ¥n is an escape sequence), assuming the I/O system supports output. Secondly, the 0x5C byte will cause problems when it appears as second byte of a two-byte character, because it will be interpreted as an escape sequence, which will mess up the interpretation, unless followed by another 0x5C.


Multiple versions

Many different versions of Shift JIS exist. There are two areas for expansion: Firstly, JIS X 0208 does not fill the whole 94×94 space encoded for it in Shift JIS, therefore there is room for more characters here—these are really extensions to JIS X 0208 rather than to Shift JIS itself. Secondly, Shift JIS has more encoding space than is needed for and (see § Shift JIS byte map below), and this space can and is used for yet more characters (as either single-byte or double-byte characters).


Windows-932 / Windows-31J

The most popular extension is Windows code page 932 (a
CCSID A CCSID (coded character set identifier) is a 16-bit number that represents a particular encoding of a specific code page. For example, Unicode is a code page that has several character encoding schemes (referred to as "transformation formats")—i ...
also used for IBM's extension to Shift JIS), which is registered with the
IANA The Internet Assigned Numbers Authority (IANA) is a standards organization that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System (DNS), media types, and other Internet P ...
as "Windows-31J", separately from Shift JIS. This was popularized by Microsoft, although Microsoft itself does not recognize the Windows-31J name and instead calls that variation "shift_jis". IBM's code page 943 includes the same double-byte codes as Microsoft's code page 932, while IBM's code page 932 includes fewer extensions (excluding those which Microsoft incorporates from NEC), and retains the character order from the 1978 edition of JIS X 0208, rather than implementing the character variant swaps from the 1983 standard. Windows-31J assigns 0x5C to U+005C REVERSE SOLIDUS (the
backslash The backslash is a mark used mainly in computing and mathematics. It is the mirror image of the common slash (punctuation), slash . It is a relatively recent mark, first documented in the 1930s. It is sometimes called a hack, whack, Escape c ...
), and 0x7E to U+007E
TILDE The tilde (, also ) is a grapheme or with a number of uses. The name of the character came into English from Spanish , which in turn came from the Latin , meaning 'title' or 'superscription'. Its primary use is as a diacritic (accent) in ...
, following
US-ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
. However, most localised fonts on Windows display U+005C as a
Yen sign The yen and yuan sign (¥) is a currency sign used for the Japanese yen and the Chinese yuan currencies when writing in Latin scripts. This character resembles a capital letter Y with a single or double horizontal stroke. The symbol is usually ...
for compatibility. It includes several extensions, namely " NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119)", in addition to setting some encoding space aside for end user definition. Windows codepage 932 is the version used in the
W3C The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working together in ...
/
WHATWG The Web Hypertext Application Technology Working Group (WHATWG) is a community of people interested in evolving HTML and related technologies. The WHATWG was founded by individuals from Apple Inc., the Mozilla Foundation and Opera Software, ...
encoding standard used by
HTML5 HTML5 (Hypertext Markup Language 5) is a markup language used for structuring and presenting hypertext documents on the World Wide Web. It was the fifth and final major HTML version that is now a retired World Wide Web Consortium (W3C) recommend ...
, which includes the "formerly proprietary extensions from IBM and NEC" from Windows-31J in its table for JIS X 0208, and also treats the label "shift_jis" interchangeably with "windows-31j" with the intent of being "compatible with deployed content".


MacJapanese

The version of Shift-JIS originating from the
classic Mac OS Mac OS (originally System Software; retronym: Classic Mac OS) is the series of operating systems developed for the Mac (computer), Macintosh family of personal computers by Apple Computer, Inc. from 1984 to 2001, starting with System 1 and end ...
(known as x-mac-japanese, Code page 10001 or MacJapanese) assigned the
tilde The tilde (, also ) is a grapheme or with a number of uses. The name of the character came into English from Spanish , which in turn came from the Latin , meaning 'title' or 'superscription'. Its primary use is as a diacritic (accent) in ...
to 0x7E (following
US-ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
, not which assigns the
overline An overline, overscore, or overbar, is a typographical feature of a horizontal and vertical, horizontal line drawn immediately above the text. In old mathematical notation, an overline was called a ''vinculum (symbol), vinculum'', a notation fo ...
here), but the
Yen sign The yen and yuan sign (¥) is a currency sign used for the Japanese yen and the Chinese yuan currencies when writing in Latin scripts. This character resembles a capital letter Y with a single or double horizontal stroke. The symbol is usually ...
to 0x5C (as in and standard ). It also extended by assigning the
backslash The backslash is a mark used mainly in computing and mathematics. It is the mirror image of the common slash (punctuation), slash . It is a relatively recent mark, first documented in the 1930s. It is sometimes called a hack, whack, Escape c ...
to 0x80 (corresponding to 0x5C in US-ASCII), the
non-breaking space In word processing and digital typesetting, a non-breaking space (), also called NBSP, required space, hard space, or fixed space ...
to 0xA0, the copyright sign to 0xFD, the
trademark symbol The trademark symbol is a symbol to indicate that the preceding mark is a trademark, specifically an unregistered trademark. It complements the registered trademark symbol which is reserved for trademarks registered with an appropriate gove ...
to 0xFE and the half-width horizontal ellipsis to 0xFF. It also added extended double byte characters; including 53 vertical presentation forms in the range 0xEB41–0xED96, at 84 JIS rows down from their canonical forms, and 260 special characters in the Shift_JIS range 0x8540–0x886D. This variant was introduced in
KanjiTalk KanjiTalk was the name given by Apple Inc, Apple to its Japanese language language localization, localization of the classic Mac OS. It consisted of translated applications, a set of Japanese fonts, and a Japanese input methods, Japanese input met ...
version 7. However, certain Mac OS typefaces used other variants. Sai Mincho and Chu Gothic use a "
PostScript PostScript (PS) is a page description language and dynamically typed, stack-based programming language. It is most commonly used in the electronic publishing and desktop publishing realm, but as a Turing complete programming language, it c ...
" variant of MacJapanese, which included additional vertical presentation forms and a different set of extended special characters, based on the NEC special characters, some of which were only available in the printer versions of the fonts. Older versions of Maru Gothic and Hon Mincho from System 7.1 encoded vertical presentation forms at 10 (not 84) JIS rows down from their canonical forms, and did not include the special character extensions, this was subsequently changed. The typical variant used with KanjiTalk version 6 placed the vertical presentation forms 10 rows down, and also used the NEC extension layout for row 13.


Shift_JISx0213 and Shift_JIS-2004

The newer
JIS X 0213 JIS X 0213 is a Japanese Industrial Standard defining coded character sets for encoding the characters used in Japan. This standard extends JIS X 0208. The first version was published in 2000 and revised in 2004 (JIS2004) and 2012. As well as ad ...
standard defines an extended variant of Shift_JIS referred to as Shift_JISx0213 (in a previous version of the standard) or Shift_JIS-2004. It is a superset of standard Shift JIS. In order to represent the allocated rows on both planes of JIS X 0213, Shift_JIS-2004 uses the following method of mapping codepoints. :s_1 = \begin \left \lfloor \frac \right \rfloor & \mbox m = 1 \mbox 1 \le k \le 62 \\ \left \lfloor \frac \right \rfloor & \mbox m = 1 \mbox 63 \le k \le 94 \\ \left \lfloor \frac \right \rfloor - \left \lfloor \frac \right \rfloor \times 3 & \mbox m = 2 \mbox k = 1, 3, 4, 5, 8, 12, 13, 14, 15 \\ \left \lfloor \frac \right \rfloor & \mbox m = 2 \mbox 78 \le k \le 94 \end :s_2 = \begin t + 63 & \mbox k \mbox 1 \le t \le 63 \\ t + 64 & \mbox k \mbox 64 \le t \le 94 \\ t + 158 & \mbox k \mbox \end In the above, s_1 s_2 is a two-byte Shift_JIS-2004 sequence, m is the number (1 or 2), k is the number (1-94) and t is the number (1-94). The ''ku'' and ''ten'' numbers are equivalent to j_1 - 32 and j_2 - 32 respectively, where j_1 j_2 is a two-byte JIS sequence referencing a given plane. The same set of characters can be represented by
EUC-JIS-2004 Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese (characters). The most commonly used EUC codes are variable-length encodings with a character belonging to an compl ...
, the EUC-JP based counterpart. Some of the additions collide with popular Shift JIS extensions, including Windows codepage 932 which is used in web standards (see
above Above may refer to: *Above (artist) Tavar Zawacki (b. 1981, California) is a Polish, Portuguese - American abstract artist and internationally recognized visual artist based in Berlin, Germany. From 1996 to 2016, he created work under the ...
). For example, compare plane 1 row 89 in (beginning 硃, 硎, 硏...) to row 89 in the JIS X 0208 variant defined in web standards (beginning 纊, 褜, 鍈...). In addition, some of the characters map to Unicode characters beyond the BMP.


Other variants

The space with lead bytes 0xF5 to 0xF9 (beyond the region used for JIS X 0208) is used by Japanese
mobile phone A mobile phone or cell phone is a portable telephone that allows users to make and receive calls over a radio frequency link while moving within a designated telephone service area, unlike fixed-location phones ( landline phones). This rad ...
operators for
pictographs A pictogram (also pictogramme, pictograph, or simply picto) is a graphical symbol that conveys meaning through its visual resemblance to a physical object. Pictograms are used in systems of writing and visual communication. A pictography is a wri ...
for use in
E-mail Electronic mail (usually shortened to email; alternatively hyphenated e-mail) is a method of transmitting and receiving Digital media, digital messages using electronics, electronic devices over a computer network. It was conceived in the ...
.
KDDI () is a Japanese telecommunications operator. It was established in 2000 through the merger of , , and . In 2001, it merged with a subsidiary named Au, which was formed through the merger of seven automotive and mobile phone companies from t ...
goes further and defines hundreds more in the space with lead bytes 0xF3 and 0xF4. Beyond even this, there have been numerous minor variations made on Shift JIS, with individual characters here and there altered. Most of these extensions and variants have no
IANA The Internet Assigned Numbers Authority (IANA) is a standards organization that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System (DNS), media types, and other Internet P ...
registration, so there is much scope for confusion, if the extensions are used. A variant is the one that must be used if wanting to encode Shift JIS in source code strings of C and similar programming languages. This variant doubles the byte 0x5C if it appears as second byte of a two-byte character, but not if it appears as a single "¥" (ASCII: "\") character, because 0x5C is the beginning of an
escape sequence In computer science, an escape sequence is a combination of characters that has a meaning other than the literal characters contained therein; it is marked by one or more preceding (and possibly terminating) characters. Examples * In C and ma ...
. The best way of handling this is a special editor which encodes this way.


Shift JIS byte map


As defined in JIS X 0208:1997

The chart below gives the detailed meaning of each byte in a stream encoded in standard (conforming to ).


With vendor or JIS X 0213 extensions

Some of the bytes which are not used for single-byte codes or initial bytes in are used by certain extensions, resulting in the layout detailed in the chart below.


See also

*
Japanese language and computers In relation to the Japanese language and computers many adaptation issues arise, some unique to Japanese language, Japanese and others common to languages which have a very large number of characters. The number of characters needed in order to w ...
*
Code page 932 (Microsoft Windows) Microsoft Windows code page 932 (abbreviated MS932, Windows-932 or ambiguously CP932), also called Windows-31J amongst other names (see § Terminology below), is the Microsoft Windows code page for the Japanese language, which is an extended v ...
*
Mojibake Mojibake (; , 'character transformation') is the garbled or gibberish text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often ...
* Shift JIS art


Footnotes


References


External links


Shift-JIS Kanji Table
a table of the non-ASCII part of the codeset * Microsoft's definition * Forms of Shift-JIS in ICU (
International Components for Unicode International Components for Unicode (ICU) is an open-source project of mature C/ C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and envir ...
) *
ibm-942 (sjis78)
*
ibm-943 (contains the \u00A5 ↔ \x5C mapping)
*
Shift JIS (contains the \u005C ↔ \x5C mapping)
{{DEFAULTSORT:Shift JIS Encodings of Japanese