The MARC-8 charset is a
MARC standard used in
MARC-21 library records. The MARC formats are standards for the representation and communication of bibliographic and related information in machine-readable form, and they are frequently used in
library database systems. The
character encoding
Character encoding is the process of assigning numbers to graphical character (computing), characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical v ...
now known as MARC-8 was introduced in 1968 as part of the MARC format. Originally based on the
Latin alphabet
The Latin alphabet, also known as the Roman alphabet, is the collection of letters originally used by the Ancient Rome, ancient Romans to write the Latin language. Largely unaltered except several letters splitting—i.e. from , and from � ...
, from 1979 to 1983 the
JACKPHY initiative expanded the repertoire to include Japanese, Arabic, Chinese, and Hebrew characters (among others), with the later addition of Cyrillic and Greek scripts. If a character is not representable in MARC-8 of a MARC-21 record, then
UTF-8
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8.
UTF-8 supports all 1,112,0 ...
must be used instead. UTF-8 has support for many more characters than MARC-8, which is rarely used outside library data.
Technical details
MARC-8 uses a variant of the
ISO-2022
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an International Organization for Standardization, ISO/International Electrotechnical Commission, IEC standard in the field of character encoding. It ...
encoding. It uses escape characters to represent characters beyond the 7-bit
ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
range of characters.
It generally uses the same logical
BiDi ordering as
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
.
The combining characters and base characters are in a different order than used in Unicode. The following are some examples. The combining characters are not always stored in reverse order as
Unicode normalization
Unicode equivalence is the specification by the Unicode character (computing), character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibi ...
. The MARC-21 standard describes the MARC-8 Unicode conversion issues in more detail.
Code structure
The
ISO/IEC 2022
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/ IEC standard in the field of character encoding. It is equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japane ...
coding specifies a two-layer mapping between character codes and displayed characters. In MARC-8, character codes from the 7-bit ASCII graphic range (0x20–0x7F) are referred to as "G0" codes, while codes from the "high ASCII" range (0xA0–0xFF) are referred to as the "G1" codes. Graphic character sets are designated and invoked by means of a multiple byte escape sequence consisting of the escape character, an Intermediate character sequence, and a Final character in the form ESC ''I'' ''F''.
The following table shows the intermediate byte after the ESC byte (hexadecimal 1B), and the corresponding ASCII characters.
The following table shows the final bytes in hexadecimal and the corresponding ASCII characters after the intermediate bytes.
The EACC is the only multibyte encoding of MARC-8, it encodes each
CJK character in three ASCII bytes.
For example, to encode the U+4EBA CJK character (人) you will need the following bytes
\x1B\x24\x31\x21\x30\x64
The \x1B\x24\x31 switches to EACC/CJK, and the \x21\x30\x64 corresponds to the U+4EBA.
Custom set extension
In addition to the ISO-2022 character sets, the following custom sets are available too. The byte designation follows the escape byte (hexadecimal 1B). There is no intermediate byte.
C0 control codes
MARC 21 uses (0x1D) as a record terminator, (0x1E) as a field terminator and (0x1F) as a subfield delimiter.
C1 control codes
The following alternative
C1 control code
The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, a ...
set is defined for bibliographic applications such as
library systems. It is mostly concerned with string collation, and with markup of bibliographic fields. Slightly different variants are defined in the German standard
DIN
DIN or Din or din may refer to:
People and language
* Din (name), people with the name
* Dīn, an Arabic word with three general senses: judgment, custom, and religion from which the name originates
* Dinka language (ISO 639 code: din), spoken ...
31626
(published in 1978 and since withdrawn) and the
ISO
The International Organization for Standardization (ISO ; ; ) is an independent, non-governmental, international standard development organization composed of representatives from the national standards organizations of member countries.
Me ...
standard ISO 6630,
the latter of which has also been adopted in Germany as DIN ISO 6630.
Where these differ is noted in the table below where applicable. MARC-8 uses the coding of and from this set, and adds some additional format effectors in locations not used by the ISO version; however, MARC 21 uses this control set only in MARC-8 records, not in Unicode-format records.
If using the
ISO/IEC 2022
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/ IEC standard in the field of character encoding. It is equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japane ...
extension mechanism, the DIN 31626 set is designated as the active C1 control character set with the sequence
0x1B 0x22 0x45
(
ESC " E
),
and the ISO 6630 / DIN ISO 6630 set is designated with the sequence
0x1B 0x22 0x42
(
ESC " B
).
The 1985 expansion of the ISO 6630 set can also be explicitly specified by using the sequence
0x1B 0x26 0x40 0x1B 0x22 0x42
(
ESC & @ ESC " B
).
Notes
References
External links
MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media- The official MARC-8 standard as maintained by the
US Library of Congress
{{DEFAULTSORT:Marc-8
Character sets