Binary Ordered Compression for Unicode (BOCU) is a
MIME
A mime artist, or simply mime (from Greek language, Greek , , "imitator, actor"), is a person who uses ''mime'' (also called ''pantomime'' outside of Britain), the acting out of a story through body motions without the use of speech, as a the ...
compatible Unicode compression scheme. BOCU-1 combines the wide applicability of
UTF-8
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8.
UTF-8 supports all 1,112,0 ...
with the compactness of
Standard Compression Scheme for Unicode (SCSU). This
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
encoding
In communications and Data processing, information processing, code is a system of rules to convert information—such as a letter (alphabet), letter, word, sound, image, or gesture—into another form, sometimes data compression, shortened or ...
is designed to be useful for compressing short strings, and maintains code point order. BOCU-1 is specified in a Unicode Technical Note.
For comparison SCSU was adopted as standard Unicode compression scheme with a byte/code point ratio similar to language-specific
code page
In computing, a code page is a character encoding and as such it is a specific association of a set of printable character (computing), characters and control characters with unique numbers. Typically each number represents the binary value in a s ...
s. SCSU has not been widely adopted, as it is not suitable for MIME "text" media types. For example, SCSU cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance. Usually, the
zip,
bzip2
bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It relies on separate external utilities such as tar for tasks such as handli ...
, and other industry standard algorithms compact larger amounts of Unicode text more efficiently.
Both SCSU and BOCU-1 are
IANA
The Internet Assigned Numbers Authority (IANA) is a standards organization that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System (DNS), media types, and other Internet P ...
registered charsets.
Details
All numbers in this section are
hexadecimal
Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...
, and all ranges are inclusive.
Code points from
U+0000
to
U+0020
are encoded in BOCU-1 as the corresponding byte value. All other code points (that is,
U+0021
through
U+D7FF
and
U+E000
through
U+10FFFF
) are encoded as a difference between the code point and a normalized version of the most recently encoded code point that was not an ASCII space (
U+0020
). The initial state is
U+0040
. The normalization mapping is as follows:
The difference between the current code point and the normalized previous code point is encoded as follows:
Each byte range is
lexicographically ordered with the following thirteen byte values excluded:
00 07 08 09 0A 0B 0C 0D 0E 0F 1A 1B 20
. For example, the byte sequence
FC 06 FF
, coding for a difference of
1156B
, is immediately followed by the byte sequence
FC 10 01
, coding for a difference of
1156C
.
Any ASCII input
U+0000
to
U+007F
excluding space
U+0020
resets the encoder to
U+0040
. Because the above-mentioned values cover line end code points
U+000D
and
U+000A
''as is'' (
0D 0A
), the encoder is in a known state at the begin of each line. The corruption of a single byte therefore affects at most one line. For comparison, the corruption of a single byte in
UTF-8
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8.
UTF-8 supports all 1,112,0 ...
affects at most one code point, and for
SCSU it can affect the entire document.
BOCU-1 offers a similar robustness also for input texts without the above-mentioned values with the special reset code
0xFF
. When a decoder finds this octet it resets its state to
U+0040
as for a line end. The use of
0xFF
reset bytes is not recommended in the BOCU-1 specification, because it conflicts with other BOCU-1 design goals, notably the ''binary order''.
The optional use of a signature
U+FEFF
at the begin of BOCU-1 encoded texts, i.e. the BOCU-1 byte sequence
FB EE 28
, changes the initial state
U+0040
to
U+FEC0
. In other words, the signature cannot simply be stripped as in most other Unicode encoding schemes. Adding a reset byte after the signature (
FB EE 28 FF
) could avoid this effect, but the BOCU-1 specification does not recommend this practice.
In theory
UTF-1 and
UTF-8
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8.
UTF-8 supports all 1,112,0 ...
could encode the original
UCS-4 set with 31 bits up to
7FFFFFFF
. BOCU-1 and
UTF-16 can encode
the modern
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
set from
U+0000
to
U+10FFFF
. Excluding the thirteen ''protected'' code points encoded as single octets BOCU-1 can use
octets in multi-byte encodings. BOCU-1 needs at most four bytes consisting of a lead byte and one to three trail bytes. The trail bytes encode a remaining "
modulo
In computing and mathematics, the modulo operation returns the remainder or signed remainder of a division, after one number is divided by another, the latter being called the '' modulus'' of the operation.
Given two positive numbers and , mo ...
243" (base 243) difference, the lead byte determines the number of trail bytes and an initial difference. Note that the reset byte
0xFF
is not ''protected'' and can occur as trail byte.
Patent
Prior to 16 November 2022, the general BOCU algorithm was covered by
United States Patent #6,737,994, which also mentions the specific BOCU-1 implementation. This patent has now expired.
IBM
International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
, which employed both of the inventors of BOCU-1 at the time it was created, stated in the Unicode Technical Note that implementers of a "fully compliant version of BOCU-1" had to contact IBM to request a royalty-free license. BOCU-1 is the only Unicode compression scheme described on the Unicode Web site that is known to have been encumbered with
intellectual property
Intellectual property (IP) is a category of property that includes intangible creations of the human intellect. There are many types of intellectual property, and some countries recognize more than others. The best-known types are patents, co ...
restrictions.
By contrast, IBM also filed for a patent on
UTF-EBCDIC, but it chose in that case to make the documentation and
encoding scheme "freely available to anyone concerned towards making the transformation format as part of the UCS standards", instead of requiring implementers to request a license.
References
See also
*
UTF-1 contains a comparison of the UTF-1,
UTF-8
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8.
UTF-8 supports all 1,112,0 ...
, and BOCU-1 designs
*
International Components for Unicode
International Components for Unicode (ICU) is an open-source project of mature C/ C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and envir ...
A library that can convert between BOCU-1 and other Unicode encodings
{{DEFAULTSORT:Binary Ordered Compression For Unicode
Data compression
Unicode Transformation Formats