A prefix code is a type of
code system distinguished by its possession of the "prefix property", which requires that there is no whole
code word
In communication, a code word is an element of a standardized code or Communications protocol, protocol. Each code word is assembled in accordance with the specific rules of the code and assigned a unique meaning. Code words are typically used for ...
in the system that is a
prefix
A prefix is an affix which is placed before the stem of a word. Adding it to the beginning of one word changes it into another word. For example, when the prefix ''un-'' is added to the word ''happy'', it creates the word ''unhappy''. Particu ...
(initial segment) of any other code word in the system. It is trivially true for fixed-length code, so only a point of consideration in
variable-length code
In coding theory a variable-length code is a code which maps source symbols to a ''variable'' number of bits.
Variable-length codes can allow sources to be compressed and decompressed with ''zero'' error (lossless data compression) and still b ...
.
For example, a code with code words has the prefix property; a code consisting of does not, because "5" is a prefix of "59" and also of "55". A prefix code is a
uniquely decodable code: given a complete and accurate sequence, a receiver can identify each word without requiring a special marker between words. However, there are uniquely decodable codes that are not prefix codes; for instance, the reverse of a prefix code is still uniquely decodable (it is a suffix code), but it is not necessarily a prefix code.
Prefix codes are also known as prefix-free codes, prefix condition codes and instantaneous codes. Although
Huffman coding
In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algor ...
is just one of many algorithms for deriving prefix codes, prefix codes are also widely referred to as "Huffman codes", even when the code was not produced by a Huffman algorithm. The term comma-free code is sometimes also applied as a synonym for prefix-free codes but in most mathematical books and articles (e.g.) a comma-free code is used to mean a
self-synchronizing code
In coding theory, especially in telecommunications, a self-synchronizing code is a uniquely decodable code in which the symbol stream formed by a portion of one code word, or by the overlapped portion of any two adjacent code words, is not a va ...
, a subclass of prefix codes.
Using prefix codes, a message can be transmitted as a sequence of concatenated code words, without any
out-of-band markers or (alternatively) special markers between words to
frame
A frame is often a structural system that supports other components of a physical construction and/or steel frame that limits the construction's extent.
Frame and FRAME may also refer to:
Physical objects
In building construction
* Framing (co ...
the words in the message. The recipient can decode the message unambiguously, by repeatedly finding and removing sequences that form valid code words. This is not generally possible with codes that lack the prefix property, for example : a receiver reading a "1" at the start of a code word would not know whether that was the complete code word "1", or merely the prefix of the code word "10" or "11"; so the string "10" could be interpreted either as a single codeword or as the concatenation of the words "1" then "0".
The variable-length
Huffman codes,
country calling codes
Country calling codes or country dial-in codes are telephone number prefixes for reaching telephone subscribers in the networks of the member countries or regions of the International Telecommunication Union (ITU). The codes are defined by the ...
, the country and publisher parts of
ISBN
The International Standard Book Number (ISBN) is a numeric commercial book identifier that is intended to be unique. Publishers purchase ISBNs from an affiliate of the International ISBN Agency.
An ISBN is assigned to each separate edition an ...
s, the Secondary Synchronization Codes used in the
UMTS
The Universal Mobile Telecommunications System (UMTS) is a third generation mobile cellular system for networks based on the GSM standard. Developed and maintained by the 3GPP (3rd Generation Partnership Project), UMTS is a component of the In ...
W-CDMA
The Universal Mobile Telecommunications System (UMTS) is a third generation mobile cellular system for networks based on the GSM standard. Developed and maintained by the 3GPP (3rd Generation Partnership Project), UMTS is a component of the I ...
3G Wireless Standard, and the
instruction set
In computer science, an instruction set architecture (ISA), also called computer architecture, is an abstract model of a computer. A device that executes instructions described by that ISA, such as a central processing unit (CPU), is called a ...
s (machine language) of most computer microarchitectures are prefix codes.
Prefix codes are not
error-correcting codes
In computing, telecommunication, information theory, and coding theory, an error correction code, sometimes error correcting code, (ECC) is used for controlling errors in data over unreliable or noisy communication channels. The central idea is ...
. In practice, a message might first be compressed with a prefix code, and then encoded again with
channel coding (including error correction) before transmission.
For any
uniquely decodable code there is a prefix code that has the same code word lengths.
[Le Boudec, Jean-Yves, Patrick Thiran, and Rüdiger Urbanke. Introduction aux sciences de l'information: entropie, compression, chiffrement et correction d'erreurs. PPUR Presses polytechniques, 2015.] Kraft's inequality characterizes the sets of code word lengths that are possible in a
uniquely decodable code.
[Berstel et al (2010) p.75]
Techniques
If every word in the code has the same length, the code is called a fixed-length code, or a block code (though the term
block code
In coding theory, block codes are a large and important family of error-correcting codes that encode data in blocks.
There is a vast number of examples for block codes, many of which have a wide range of practical applications. The abstract defini ...
is also used for fixed-size
error-correcting code
In computing, telecommunication, information theory, and coding theory, an error correction code, sometimes error correcting code, (ECC) is used for controlling errors in data over unreliable or noisy communication channels. The central idea i ...
s in
channel coding). For example,
ISO 8859-15
ISO/IEC 8859-15:1999, ''Information technology — 8-bit single-byte coded graphic character sets — Part 15: Latin alphabet No. 9'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1999. ...
letters are always 8 bits long.
UTF-32/UCS-4
UTF-32 (32-bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode cod ...
letters are always 32 bits long.
ATM cells are always 424 bits (53 bytes) long. A fixed-length code of fixed length ''k'' bits can encode up to
source symbols.
A fixed-length code is necessarily a prefix code. It is possible to turn any code into a fixed-length code by padding fixed symbols to the shorter prefixes in order to meet the length of the longest prefixes. Alternately, such padding codes may be employed to introduce redundancy that allows autocorrection and/or synchronisation. However, fixed length encodings are inefficient in situations where some words are much more likely to be transmitted than others.
Truncated binary encoding
Truncated binary encoding is an entropy encoding typically used for uniform probability distributions with a finite alphabet. It is parameterized by an alphabet with total size of number ''n''. It is a slightly more general form of binary encodin ...
is a straightforward generalization of fixed-length codes to deal with cases where the number of symbols ''n'' is not a power of two. Source symbols are assigned codewords of length ''k'' and ''k''+1, where ''k'' is chosen so that ''2
k < n ≤ 2
k+1''.
Huffman coding
In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algor ...
is a more sophisticated technique for constructing variable-length prefix codes. The Huffman coding algorithm takes as input the frequencies that the code words should have, and constructs a prefix code that minimizes the weighted average of the code word lengths. (This is closely related to minimizing the entropy.) This is a form of
lossless data compression
Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statist ...
based on
entropy encoding
In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method ...
.
Some codes mark the end of a code word with a special "comma" symbol, different from normal data. This is somewhat analogous to the spaces between words in a sentence; they mark where one word ends and another begins. If every code word ends in a comma, and the comma does not appear elsewhere in a code word, the code is automatically prefix-free. However, modern communication systems send everything as sequences of "1" and "0" – adding a third symbol would be expensive, and using it only at the ends of words would be inefficient.
Morse code
Morse code is a method used in telecommunication to encode text characters as standardized sequences of two different signal durations, called ''dots'' and ''dashes'', or ''dits'' and ''dahs''. Morse code is named after Samuel Morse, one ...
is an everyday example of a variable-length code with a comma. The long pauses between letters, and the even longer pauses between words, help people recognize where one letter (or word) ends, and the next begins. Similarly,
Fibonacci coding
In mathematics and computing, Fibonacci coding is a universal code which encodes positive integers into binary code words. It is one example of representations of integers based on Fibonacci numbers. Each code word ends with "11" and contains n ...
uses a "11" to mark the end of every code word.
Self-synchronizing code
In coding theory, especially in telecommunications, a self-synchronizing code is a uniquely decodable code in which the symbol stream formed by a portion of one code word, or by the overlapped portion of any two adjacent code words, is not a va ...
s are prefix codes that allow
frame synchronization
In telecommunication, frame synchronization or framing is the process by which, while receiving a stream of framed data, incoming frame alignment signals (i.e., a distinctive bit sequences or syncwords) are identified (that is, distinguished fro ...
.
Related concepts
A suffix code is a set of words none of which is a suffix of any other; equivalently, a set of words which are the reverse of a prefix code. As with a prefix code, the representation of a string as a concatenation of such words is unique. A bifix code is a set of words which is both a prefix and a suffix code.
[Berstel et al (2010) p.58]
An optimal prefix code is a prefix code with minimal average length. That is, assume an alphabet of symbols with probabilities
for a prefix code . If is another prefix code and
are the lengths of the codewords of , then
.
Prefix codes in use today
Examples of prefix codes include:
* variable-length
Huffman codes
*
country calling codes
Country calling codes or country dial-in codes are telephone number prefixes for reaching telephone subscribers in the networks of the member countries or regions of the International Telecommunication Union (ITU). The codes are defined by the ...
*
Chen–Ho encoding
Chen–Ho encoding is a memory-efficient alternate system of binary encoding for decimal digits.
The traditional system of binary encoding for decimal digits, known as binary-coded decimal (BCD), uses four bits to encode each digit, resulting ...
* the country and publisher parts of
ISBN
The International Standard Book Number (ISBN) is a numeric commercial book identifier that is intended to be unique. Publishers purchase ISBNs from an affiliate of the International ISBN Agency.
An ISBN is assigned to each separate edition an ...
s
* the Secondary Synchronization Codes used in the
UMTS
The Universal Mobile Telecommunications System (UMTS) is a third generation mobile cellular system for networks based on the GSM standard. Developed and maintained by the 3GPP (3rd Generation Partnership Project), UMTS is a component of the In ...
W-CDMA
The Universal Mobile Telecommunications System (UMTS) is a third generation mobile cellular system for networks based on the GSM standard. Developed and maintained by the 3GPP (3rd Generation Partnership Project), UMTS is a component of the I ...
3G Wireless Standard
*
VCR Plus+ codes
*
Unicode Transformation Format
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whi ...
, in particular the
UTF-8
UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''.
UTF-8 is capable of ...
system for encoding
Unicode
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
characters, which is both a prefix-free code and a
self-synchronizing code
In coding theory, especially in telecommunications, a self-synchronizing code is a uniquely decodable code in which the symbol stream formed by a portion of one code word, or by the overlapped portion of any two adjacent code words, is not a va ...
*
variable-length quantity
A variable-length quantity (VLQ) is a universal code that uses an arbitrary number of binary octets (eight-bit bytes) to represent an arbitrarily large integer. A VLQ is essentially a base-128 representation of an unsigned integer with the additi ...
Techniques
Commonly used techniques for constructing prefix codes include
Huffman codes and the earlier
Shannon–Fano codes, and
universal codes such as:
*
Elias delta coding
Elias δ code or Elias delta code is a universal code encoding the positive integers developed by Peter Elias.
Encoding
To code a number ''X'' ≥ 1:
# Let ''N'' = ⌊log2 ''X''⌋; be the highest power of 2 in ''X'', so 2''N'' ≤ ' ...
*
Elias gamma coding
Elias γ code or Elias gamma code is a universal code encoding positive integers developed by Peter Elias. It is used most commonly when coding integers whose upper-bound cannot be determined beforehand.
Encoding
To code a number ''x'' � ...
*
Elias omega coding
*
Fibonacci coding
In mathematics and computing, Fibonacci coding is a universal code which encodes positive integers into binary code words. It is one example of representations of integers based on Fibonacci numbers. Each code word ends with "11" and contains n ...
*
Levenshtein coding
*
Unary coding
*
Golomb Rice code
Golomb coding is a lossless data compression method using a family of data compression codes invented by Solomon W. Golomb in the 1960s. Alphabets following a geometric distribution will have a Golomb code as an optimal prefix code, making Gol ...
*
Straddling checkerboard
A straddling checkerboard is a device for converting an alphanumeric plaintext into digits whilst simultaneously achieving fractionation (a simple form of information diffusion) and data compression relative to other schemes using digits. It als ...
(simple cryptography technique which produces prefix codes)
* binary coding
Notes
References
*
*
* D.A. Huffman, "A method for the construction of minimum-redundancy codes", Proceedings of the I.R.E., Sept. 1952, pp. 1098–1102 (Huffman's original article)
Profile: David A. Huffman Scientific American
''Scientific American'', informally abbreviated ''SciAm'' or sometimes ''SA'', is an American popular science magazine. Many famous scientists, including Albert Einstein and Nikola Tesla, have contributed articles to it. In print since 1845, it i ...
, Sept. 1991, pp. 54–58 (Background story)
*
Thomas H. Cormen,
Charles E. Leiserson,
Ronald L. Rivest, and
Clifford Stein
Clifford Seth Stein (born December 14, 1965), a computer scientist, is a professor of industrial engineering and operations research at Columbia University in New York, NY, where he also holds an appointment in the Department of Computer Scien ...
. ''
Introduction to Algorithms
''Introduction to Algorithms'' is a book on computer programming by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. The book has been widely used as the textbook for algorithms courses at many universities and is ...
'', Second Edition. MIT Press and McGraw-Hill, 2001. . Section 16.3, pp. 385–392.
* {{FS1037C
External links
Codes, trees and the prefix propertyby Kona Macphee
Coding theory
Prefixes
Data compression
Lossless compression algorithms