UTF-EBCDIC
   HOME

TheInfoList



OR:

UTF-EBCDIC is a
character encoding Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values tha ...
capable of encoding all 1,112,064 valid character
code point In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
s in
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
using one to five one-
byte The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable uni ...
(8-bit) code units (in contrast to a maximum of four for
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of e ...
). It is meant to be
EBCDIC Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight- bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding ...
-friendly, so that legacy EBCDIC applications on
mainframes A mainframe computer, informally called a mainframe or big iron, is a computer used primarily by large organizations for critical applications like bulk data processing for tasks such as censuses, industry and consumer statistics, enterprise ...
may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of e ...
's advantages for existing
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16. To produce the UTF-EBCDIC encoded version of a series of Unicode code points, an encoding based on UTF-8 (known in the specification as UTF-8-Mod) is applied first (creating what the specification calls an I8 sequence). The main difference between this encoding and UTF-8 is that it allows Unicode code points U+0080 through U+009F (the
C1 control code The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, ...
s) to be represented as a single byte and therefore later mapped to corresponding EBCDIC control codes. In order to achieve this, UTF-8-Mod uses 101XXXXX instead of 10XXXXXX as the format for trailing bytes in a multi-byte sequence. As this can only hold 5 bits rather than 6, the UTF-8-Mod encoding of codepoints above U+03FF are larger than the UTF-8 encoding. The UTF-8-Mod transformation leaves the data in an ASCII-based format (for example, U+0041 "A" is still encoded as 01000001), so each byte is fed through a reversible (one-to-one) lookup table to produce the final UTF-EBCDIC encoding. For example, 01000001 in this table maps to 11000001; thus the UTF-EBCDIC encoding of U+0041 (Unicode's "A") is 0xC1 (EBCDIC's "A"). This encoding form is rarely used, even on the EBCDIC-based mainframes for which it was designed. IBM EBCDIC-based mainframe operating systems, such as
z/OS z/OS is a 64-bit operating system for IBM z/Architecture mainframes, introduced by IBM in October 2000. It derives from and is the successor to OS/390, which in turn was preceded by a string of MVS versions.Starting with the earliest: * ...
, usually use
UTF-16 UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as cod ...
for complete Unicode support. For example,
IBM Db2 Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON a ...
,
COBOL COBOL (; an acronym for "common business-oriented language") is a compiled English-like computer programming language designed for business use. It is an imperative, procedural and, since 2002, object-oriented language. COBOL is primarily u ...
,
PL/I PL/I (Programming Language One, pronounced and sometimes written PL/1) is a procedural, imperative computer programming language developed and published by IBM. It is designed for scientific, engineering, business and system programming. I ...
,
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
and the IBM
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
toolkit support UTF-16 on IBM mainframes.


Codepage layout

There are 160 characters with single-byte encodings in UTF-EBCDIC (compared to 128 in UTF-8). As can be seen, the single-byte portion is similar to IBM-1047 instead of IBM-37 due to the location of the square brackets.
CCSID A CCSID (coded character set identifier) is a 16-bit number that represents a particular encoding of a specific code page. For example, Unicode is a code page that has several encoding (so called "transformation") forms, like UTF-8, UTF-16 and U ...
37 has [] at hex BA and BB instead of at hex AD and BD respectively.


Oracle UTFE

''Oracle UTFE'' is a Unicode 3.0 UTF-8
Oracle database Oracle Database (commonly referred to as Oracle DBMS, Oracle Autonomous Database, or simply as Oracle) is a multi-model database management system produced and marketed by Oracle Corporation. It is a database commonly used for running online ...
variation, similar to the
CESU-8 The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26. A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range U+0000 to U+FFFF, ...
variant of UTF-8, where supplementary characters are encoded as two 4-byte characters rather than a single 4- or 5-byte character. It is used only on EBCDIC platforms. Advantages: * Only Unicode character set for EBCDIC. * Length of SQL CHAR types can be specified in number of characters. * Binary order of the SQL CHAR columns is same as binary order of the SQL NCHAR columns if the data consists of same supplementary characters. Consequently, these columns sort the same for identical strings. Disadvantages: * Supplementary characters occupy six bytes instead of four bytes only. Consequently, supplementary characters need to be converted. * UTFE is not a Unicode standard encoding. Clients requiring UTF-8 encoding must convert data on retrieval and storage.


Mksh nega-UTF-8

Mksh KornShell (ksh) is a Unix shell which was developed by David Korn at Bell Labs in the early 1980s and announced at USENIX on July 14, 1983. The initial development was based on Bourne shell source code. Other early contributors were Bell L ...
is a POSIX shell with a “utf8-mode” that complicates porting to EBCDIC. Instead of implementing UTF-EBCDIC (which is “not used by anyone”), the developers found it easier to just use normal UTF-8 and convert it to EBCDIC by pretending it’s the “extended ASCII” encoding for the current code page. Doing so destroys EBCDIC control characters not found in ASCII, however.


See also

*
UTF-1 UTF-1 is a method of transforming ISO/IEC 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for mult ...
*
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of e ...
* BOCU-1


References


External links

* V.S. Umamaheswaran
Unicode Technical Report #16: the definition of UTF-EBCDIC
(2002-04-16) {{DEFAULTSORT:Utf-Ebcdic Character encoding Unicode Transformation Formats