A binary-to-text encoding is
encoding
In communications and Data processing, information processing, code is a system of rules to convert information—such as a letter (alphabet), letter, word, sound, image, or gesture—into another form, sometimes data compression, shortened or ...
of
data
Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...
in
plain text
In computing, plain text is a loose term for data (e.g. file contents) that represent only characters of readable material but not its graphical representation nor other objects ( floating-point numbers, images, etc.). It may also include a lim ...
. More precisely, it is an encoding of binary data in a sequence of
printable characters. These encodings are necessary for transmission of data when the
communication channel
A communication channel refers either to a physical transmission medium such as a wire, or to a logical connection over a multiplexed medium such as a radio channel in telecommunications and computer networking. A channel is used for infor ...
does not allow binary data (such as
email
Electronic mail (usually shortened to email; alternatively hyphenated e-mail) is a method of transmitting and receiving Digital media, digital messages using electronics, electronic devices over a computer network. It was conceived in the ...
or
NNTP) or is not
8-bit clean.
PGP documentation () uses the term "ASCII armor" for binary-to-text encoding when referring to
Base64
In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits ...
.
Overview
The basic need for a binary-to-text encoding comes from a need to communicate arbitrary
binary data
Binary data is data whose unit can take on only two possible states. These are often labelled as 0 and 1 in accordance with the binary numeral system and Boolean algebra.
Binary data occurs in many different technical and scientific fields, wh ...
over preexisting
communications protocol
A communication protocol is a system of rules that allows two or more entities of a communications system to transmit information via any variation of a physical quantity. The protocol defines the rules, syntax, semantics (computer science), sem ...
s that were designed to carry only English language
human-readable text. Those communication protocols may only be 7-bit safe (and within that avoid certain ASCII control codes), and may require
line breaks at certain maximum intervals, and may not maintain
whitespace. Thus, only the 94
printable ASCII characters are "safe" to use to convey data.
Description
The
ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
text-encoding standard uses 7 bits to encode characters. With this it is possible to encode 128 (i.e. 2
7) unique values (0–127) to represent the alphabetic, numeric, and punctuation characters commonly used in
English, plus a selection of
Control characters
In computing and telecommunications, a control character or non-printing character (NPC) is a code point in a character set that does not represent a written character or symbol. They are used as in-band signaling to cause effects other than ...
which do not represent printable characters. For example, the capital letter A is represented in 7 bits as 100 0001
2, 0x41 (101
8) , the numeral 2 is 011 0010
2 0x32 (62
8), the character
} is 111 1101
2 0x7D (175
8), and the
Control character
In computing and telecommunications, a control character or non-printing character (NPC) is a code point in a character encoding, character set that does not represent a written Character (computing), character or symbol. They are used as in-ba ...
RETURN is 000 1101
2 0x0D (15
8).
In contrast, most computers store data in memory organized in eight-bit
byte
The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...
s. Files that contain machine-executable code and non-textual data typically contain all 256 possible eight-bit byte values. Many computer programs came to rely on this distinction between seven-bit ''text'' and eight-bit ''binary'' data, and would not function properly if non-ASCII characters appeared in data that was expected to include only ASCII text. For example, if the value of the eighth bit is not preserved, the program might interpret a byte value above 127 as a flag telling it to perform some function.
It is often desirable, however, to be able to send non-textual data through text-based systems, such as when one might attach an image file to an e-mail message. To accomplish this, the data is encoded in some way, such that eight-bit data is encoded into seven-bit ASCII characters (generally using only alphanumeric and punctuation characters—the ASCII printable characters). Upon safe arrival at its destination, it is then decoded back to its eight-bit form. This process is referred to as binary to text encoding. Many programs perform this conversion to allow for data-transport, such as
PGP and
GNU Privacy Guard.
Encoding plain text
Binary-to-text encoding methods are also used as a mechanism for encoding
plain text
In computing, plain text is a loose term for data (e.g. file contents) that represent only characters of readable material but not its graphical representation nor other objects ( floating-point numbers, images, etc.). It may also include a lim ...
. For example:
* Some systems have a more limited character set they can handle; not only are they not
8-bit clean, some cannot even handle every printable ASCII character.
* Other systems have limits on the number of characters that may appear between line breaks, such as the "1000 characters per line" limit of some
Simple Mail Transfer Protocol software, as allowed by .
* Still others add
headers or
trailers to the text.
* A few poorly-regarded but still-used protocols use
in-band signaling, causing confusion if specific patterns appear in the message. The best-known is the string "From " (including trailing space) at the beginning of a line, used to separate mail messages in the
mbox
Mbox is a generic term for a family of related file formats used for holding collections of email messages. It was first implemented in Research Unix, Fifth Edition Unix.
All messages in an mbox mailbox are Concatenation, concatenated and store ...
file format.
By using a binary-to-text encoding on messages that are already plain text, then decoding on the other end, one can make such systems appear to be completely
transparent. This is sometimes referred to as 'ASCII armoring'. For example, the ViewState component of
ASP.NET uses
base64
In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits ...
encoding to safely transmit text via HTTP POST, in order to avoid
delimiter collision
A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts ...
.
Encoding standards
The table below compares the most used forms of binary-to-text encodings. The efficiency listed is the ratio between the number of bits in the input and the number of bits in the encoded output.
The 95
isprint codes 32 to 126 are known as the
ASCII printable characters.
Some older and today uncommon formats include BOO,
BTOA
Ascii85, also called Base85, is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII characters to represent four bytes of binary data (making the encoded size larger than the original, assuming ...
, and USR encoding.
Most of these encodings generate text containing only a subset of all
ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
printable characters: for example, the
base64
In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits ...
encoding generates text that only contains upper case and lower case letters, (A–Z, a–z), numerals (0–9), and the "+", "/", and "=" symbols.
Some of these encoding (quoted-printable and percent encoding) are based on a set of allowed characters and a single
escape character
In computing and telecommunications, an escape character is a character that invokes an alternative interpretation on the following characters in a character sequence. An escape character is a particular case of metacharacters. Generally, the ...
. The allowed characters are left unchanged, while all other characters are converted into a string starting with the escape character. This kind of conversion allows the resulting text to be almost readable, in that letters and digits are part of the allowed characters, and are therefore left as they are in the encoded text. These encodings produce the shortest plain ASCII output for input that is mostly printable ASCII.
Some other encodings (
base64
In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits ...
,
uuencoding) are based on mapping all possible sequences of six
bits into different printable characters. Since there are more than 2
6 = 64 printable characters, this is possible. A given sequence of bytes is translated by viewing it as a stream of bits, breaking this stream in chunks of six bits and generating the sequence of corresponding characters. The different encodings differ in the mapping between sequences of bits and characters and in how the resulting text is formatted.
Some encodings (the original version of BinHex and the recommended encoding for
CipherSaber) use four bits instead of six, mapping all possible sequences of 4 bits onto the 16 standard
hexadecimal
Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...
digits. Using 4 bits per encoded character leads to a 50% longer output than base64, but simplifies encoding and decoding—expanding each byte in the source independently to two encoded bytes is simpler than base64's expanding 3 source bytes to 4 encoded bytes.
Out of
PETSCII's first 192 codes, 164 have visible representations when quoted: 5 (white), 17–20 and 28–31 (colors and cursor controls), 32–90 (ascii equivalent), 91–127 (graphics), 129 (orange), 133–140 (function keys), 144–159 (colors and cursor controls), and 160–192 (graphics).
This theoretically permits encodings, such as base128, between PETSCII-speaking machines.
See also
*
Alphanumeric shellcode
In hacker (computer security), hacking, a shellcode is a small piece of code used as the Payload (computing), payload in the exploit (computer security), exploitation of a software Vulnerability (computing), vulnerability. It is called "shellco ...
*
Character encoding
Character encoding is the process of assigning numbers to graphical character (computing), characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical v ...
*
Compiling
In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs tha ...
*
Computer number format
*
Geocode
*
Numeral system
A numeral system is a writing system for expressing numbers; that is, a mathematical notation for representing numbers of a given set, using digits or other symbols in a consistent manner.
The same sequence of symbols may represent differe ...
s,
listed by notation type
*
Punycode
Notes
References
{{Reflist
Computer file formats
Character encoding