Base32 is an encoding method based on the
base-32
numeral system
A numeral system is a writing system for expressing numbers; that is, a mathematical notation for representing numbers of a given set, using digits or other symbols in a consistent manner.
The same sequence of symbols may represent differe ...
. It uses an alphabet of 32
digits, each of which represents a different combination of 5
bits (2
5). Since base32 is not very widely adopted, the question of notation—which characters to use to represent the 32 digits—is not as settled as in the case of more well-known numeral systems (such as
hexadecimal
Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...
), though
RFCs and unofficial and de-facto standards exist. One way to represent Base32 numbers in
human-readable form is using digits 0–9 followed by the twenty-two upper-case letters A–V. However, many other variations are used in different contexts. Historically,
Baudot code
The Baudot code () is an early character encoding for telegraphy invented by Émile Baudot in the 1870s. It was the predecessor to the International Telegraph Alphabet No. 2 (ITA2), the most common teleprinter code in use before ASCII. Each ch ...
could be considered a modified (
stateful) base32 code. Base32 is often used to represent byte strings.
RFC 4648 encodings
The October 2006 proposed Internet standard documents
base16, base32 and base64 encodings. It includes two schemes for base32, but recommends one over the other. It further recommends that regardless of precedent, only the alphabet it defines in its section 6 actually be called base32, and that the other similar alphabet in its section 7 instead be called base32hex. Agreement with those recommendations is not universal. Care needs to be taken when using systems that are called base32, as those systems could be base32 per RFC 4648 §6, or per §7 (possibly disregarding that RFC's deprecation of the simpler name for the latter), or they could be yet another encoding variant, see further below.
Base 32 Encoding per §6
The most widely used base32 alphabet is defined in RF
4648 §6and the earlier (2003). The scheme was originally designed in 2000 by John Myers for
SASL/
GSSAPI. It uses an
alphabet
An alphabet is a standard set of letter (alphabet), letters written to represent particular sounds in a spoken language. Specifically, letters largely correspond to phonemes as the smallest sound segments that can distinguish one word from a ...
of
A–
Z, followed by
2–
7. The digits
0,
1 and
8 are skipped due to their similarity with the letters
O,
I and
B (thus "2" has a decimal value of
26).
In some circumstances padding is not required or used (the padding can be inferred from the length of the string modulo 8). RFC 4648 states that padding must be used unless the specification of the standard (referring to the RFC) explicitly states otherwise. Excluding padding is useful when using Base32 encoded data in URL tokens or file names where the padding character could pose a problem.
This is an example of a Base32 representation using the previously described 32-character set (
IPFS CIDv1 in Base32 upper-case encoding):
Base 32 Encoding with Extended Hex Alphabet per §7
"Extended hex" base 32 or base32hex,
another scheme for base 32 per RF
4648 §7 extends
hexadecimal
Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...
in a more natural way: Its lower half is identical with hexadecimal, and beyond that, base32hex simply continues the alphabet through to the letter V.
This scheme was first proposed by Christian Lanctot, a programmer working at
Sage software, in a letter to ''
Dr. Dobb's'' magazine in March 1999 as part of a suggested solution for the
Y2K bug
The term year 2000 problem, or simply Y2K, refers to potential computer errors related to the formatting and storage of calendar data for dates in and after the year 2000. Many programs represented four-digit years with only the final two d ...
. Lanctot referred to it as "Double Hex". The same alphabet was described in 2000 in under the name "Base-32". RFC 4648, while acknowledging existing use of this version in
NSEC3, refers to it as base32hex and discourages referring to it as only "base32".
Since this notation uses digits 0–9 followed by consecutive letters of the alphabet, it matches the digits used by the
JavaScript
JavaScript (), often abbreviated as JS, is a programming language and core technology of the World Wide Web, alongside HTML and CSS. Ninety-nine percent of websites use JavaScript on the client side for webpage behavior.
Web browsers have ...
parseInt()
function and the
Python int()
constructor when a base larger than 10 (such as 16 or 32) is specified. It also retains hexadecimal's property of preserving bitwise sort order of the represented data, unlike RFC 4648's §6 base32, or base64.
Unlike many other base 32 notation systems, base32hex digits beyond 9 are contiguous. However, its set of digits includes characters that may visually conflict. With the right
font
In metal typesetting, a font is a particular size, weight and style of a ''typeface'', defined as the set of fonts that share an overall design.
For instance, the typeface Bauer Bodoni (shown in the figure) includes fonts " Roman" (or "regul ...
it is possible to visually distinguish between 0, O and 1, I, but other fonts may be unsuitable, as those letters could be hard for humans to tell apart, especially when the context English usually provides is not present in a notation system that is only expressing numbers. The choice of font is not controlled by notation or encoding, yet base32hex makes no attempt to compensate for the shortcomings of affected fonts.
Alternative encoding schemes
Changing the Base32 alphabet, all alternative standards have similar combinations of alphanumeric symbols.
z-base-32
z-base-32 is a Base32 encoding designed by
Zooko Wilcox-O'Hearn to be easier for human use and more compact. It includes
1,
8 and
9 but excludes
l,
v,
0 and
2. It also permutes the alphabet so that the easier characters are the ones that occur more frequently. It compactly encodes bitstrings whose length in bits is not a multiple of 8 and omits trailing padding characters. z-base-32 was used in the
Mnet open source project, and is currently used in
Phil Zimmermann's
ZRTP
ZRTP (composed of Z and Real-time Transport Protocol) is a cryptographic key-agreement protocol to negotiate the keys for encryption between two end points in a Voice over IP (VoIP) phone telephony call based on the Real-time Transport Protocol ...
protocol, and in the
Tahoe-LAFS open source project.
Crockford's Base32
Another alternative design for Base32 is created by
Douglas Crockford
Douglas Crockford is an American computer programmer who is involved in the development of the JavaScript language. He specified the data format JSON (JavaScript Object Notation), and has developed various JavaScript related tools such as the s ...
, who proposes using additional characters for a mod-37 checksum. It excludes the letters I, L, and O to avoid confusion with digits. It also excludes the letter U to reduce the likelihood of accidental obscenity.
Libraries to encode binary data in Crockford's Base32 are available in a variety of languages.
Electrologica
An earlier form of base 32 notation was used by programmers working on the
Electrologica X1 to represent machine addresses. The "digits" were represented as decimal numbers from 0 to 31. For example, 12-16 would represent the machine address ''400'' (= 12 × 32 + 16).
Geohash
See
Geohash algorithm, used to represent latitude and longitude values in one (bit-interlaced) positive integer. The base32 representation of Geohash uses all decimal digits (0–9) and almost all of the lower case alphabet, except letters "a", "i", "l", "o", as shown by the following character map:
Turing's encoding
In approximately 1950,
Alan Turing
Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist. He was highly influential in the development of theoretical computer ...
wrote software requirements for the Manchester Mark I computing system.
A transcription of Turing's
manualfor the Mark I is available on archive.org.
The University of Manchester's archive site commemorating 60 years of computing
has
of the base 32 encoding that Turing used. The table and the accompanying explanation also appear in the manual.
Another account of this period in Turing's life appears on his biography page under
Early computers and the Turing test.
Video games
Before
NVRAM became universal, several video games for
Nintendo
is a Japanese Multinational corporation, multinational video game company headquartered in Kyoto. It develops, publishes, and releases both video games and video game consoles.
The history of Nintendo began when craftsman Fusajiro Yamauchi ...
platforms used base 31 numbers for
passwords
A password, sometimes called a passcode, is secret data, typically a string of characters, usually used to confirm a user's identity. Traditionally, passwords were expected to be memorized, but the large number of password-protected services ...
.
These systems omit vowels (except Y) to prevent the game from accidentally giving a
profane password.
Thus, the characters are generally some minor variation of the following set: 0–9, B, C, D, F, G, H, J, K, L, M, N, P, Q, R, S, T, V, W, X, Y, Z, and some punctuation marks.
Games known to use such a system include ''
Mario Is Missing!'', ''
Mario's Time Machine'', ''
Tetris Blast'', and
''The Lord of the Rings'' (Super NES).
Word-safe alphabet
The word-safe Base32 alphabet is an extension of the
Open Location Code Base20 alphabet. That alphabet uses 8 numeric digits and 12 case-sensitive letter digits chosen to avoid accidentally forming words. Treating the alphabet as case-sensitive produces a 32 (8+12+12) digit set.
Comparisons with other systems
Advantages
Base32 has a number of advantages over
Base64
In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits ...
:
# The resulting
character set
Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up a c ...
is all one case, which can often be beneficial when using a
case-insensitive
In computers, case sensitivity defines whether uppercase and lowercase letters are treated as distinct (case-sensitive) or equivalent (case-insensitive). For instance, when users interested in learning about dogs search an e-book, "dog" and "Dog ...
filesystem,
DNS names, spoken language, or human memory.
# The result can be used as a file name because it cannot possibly contain the '/' symbol, which is the
Unix
Unix (, ; trademarked as UNIX) is a family of multitasking, multi-user computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, a ...
path separator.
# The alphabet can be selected to avoid similar-looking pairs of different symbols, so the strings can be accurately transcribed by hand. (For example, the
§6 symbol set omits the digits for one, eight and zero, since they could be confused with the letters 'I', 'B', and 'O'.)
# A result excluding padding can be included in a
URL
A uniform resource locator (URL), colloquially known as an address on the Web, is a reference to a resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identi ...
without
encoding
In communications and Data processing, information processing, code is a system of rules to convert information—such as a letter (alphabet), letter, word, sound, image, or gesture—into another form, sometimes data compression, shortened or ...
any characters.
Base32 has advantages over
hexadecimal
Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...
/
Base16:
# Base32 representation takes 20% less space. (1000 bits takes 200 characters, compared with 250 for Base16.)
Compared with 8-bit-based encodings, 5-bit systems might also have advantages when used for character transmission:
# Featuring the complete alphabet, the RFC 4648 §6 Base32 scheme and similar allow encoding two more characters per 32-bit integer (for a total of 6 instead of 4, with 2 bits to spare), saving bandwidth in constrained domains such as radiomeshes.
Disadvantages
Base32 representation takes roughly 20% more space than
Base64
In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits ...
. Also, because it encodes five 8-bit bytes (40 bits) to eight 5-bit base32 characters rather than three 8-bit bytes (24 bits) to four 6-bit base64 characters, padding to an 8-character boundary is a greater burden on short messages (which may be a reason to elide padding, which is an option in ).
Even if Base32 takes roughly 20% less space than
hexadecimal
Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...
, Base32 is much less used. Hexadecimal can easily be mapped to bytes because two hexadecimal digits is a byte. Base32 does not map to individual bytes. However, two Base32 digits correspond to ten bits, which can encode (32 × 32 =) 1,024 values, with obvious applications for orders of magnitude of
multiple-byte units in terms of powers of 1,024.
Hexadecimal is easier to learn and remember, since that only entails memorising the numerical values of six additional symbols (A–F), and even if those are not instantly recalled, it is easier to count through just over a handful of values.
Software implementations
Base32
programs are suitable for encoding arbitrary byte data using a restricted set of symbols that can both be conveniently used by humans and processed by computers.
Base32 implementations use a symbol set made up of at least 32 different characters (sometimes a 33rd for padding), as well as an algorithm for encoding arbitrary sequences of 8-bit bytes into a Base32 alphabet. Because more than one 5-bit Base32 character is needed to represent each 8-bit input byte, if the input is not a multiple of 5 bytes (40 bits), then it doesn't fit exactly in 5-bit Base32 characters. In that case, some specifications require padding characters to be added while some require extra zero bits to make a multiple of 5 bits. The closely related Base64 system, in contrast, uses a set of 64 symbols (or 65 symbols when padding is used).
Base32 implementations in C/C++, Perl, Java, JavaScript Python, Go and Ruby are available.
See also
Notes
References
*
{{Data Exchange
Binary-to-text encoding formats
Power-of-two numeral systems