A prefix code is a type of

code In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communicati ...

system distinguished by its possession of the prefix property, which requires that there is no whole code word in the system that is a

prefix A prefix is an affix which is placed before the stem of a word. Particularly in the study of languages, a prefix is also called a preformative, because it alters the form of the word to which it is affixed. Prefixes, like other affixes, can b ...

(initial segment) of any other code word in the system. It is trivially true for fixed-length codes, so only a point of consideration for variable-length codes. For example, a code with code has the prefix property; a code consisting of does not, because "5" is a prefix of "59" and also of "55". A prefix code is a uniquely decodable code: given a complete and accurate sequence, a receiver can identify each word without requiring a special marker between words. However, there are uniquely decodable codes that are not prefix codes; for instance, the reverse of a prefix code is still uniquely decodable (it is a suffix code), but it is not necessarily a prefix code. Prefix codes are also known as prefix-free codes, prefix condition codes and instantaneous codes. Although

Huffman coding In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by ...

is just one of many algorithms for deriving prefix codes, prefix codes are also widely referred to as "Huffman codes", even when the code was not produced by a Huffman algorithm. The term comma-free code is sometimes also applied as a synonym for prefix-free codes but in most mathematical books and articles (e.g.) a comma-free code is used to mean a

self-synchronizing code In coding theory, especially in telecommunications, a self-synchronizing code is a uniquely decodable code in which the symbol stream formed by a portion of one code word, or by the overlapped portion of any two adjacent code words, is not a ...

, a subclass of prefix codes. Using prefix codes, a message can be transmitted as a sequence of concatenated code words, without any out-of-band markers or (alternatively) special markers between words to frame the words in the message. The recipient can decode the message unambiguously, by repeatedly finding and removing sequences that form valid code words. This is not generally possible with codes that lack the prefix property, for example : a receiver reading a "1" at the start of a code word would not know whether that was the complete code word "1", or merely the prefix of the code word "10" or "11"; so the string "10" could be interpreted either as a single codeword or as the concatenation of the words "1" then "0". The variable-length Huffman codes, country calling codes, the country and publisher parts of

ISBN The International Standard Book Number (ISBN) is a numeric commercial book identifier that is intended to be unique. Publishers purchase or receive ISBNs from an affiliate of the International ISBN Agency. A different ISBN is assigned to e ...

s, the Secondary Synchronization Codes used in the

UMTS The Universal Mobile Telecommunications System (UMTS) is a 3G mobile cellular system for networks based on the GSM standard. UMTS uses Wideband Code Division Multiple Access, wideband code-division multiple access (W-CDMA) radio access technolog ...

W-CDMA The Universal Mobile Telecommunications System (UMTS) is a 3G mobile cellular system for networks based on the GSM standard. UMTS uses wideband code-division multiple access (W-CDMA) radio access technology to offer greater spectral efficiency ...

3G Wireless Standard, and the

instruction set In computer science, an instruction set architecture (ISA) is an abstract model that generally defines how software controls the CPU in a computer or a family of computers. A device or program that executes instructions described by that ISA, s ...

s (machine language) of most computer microarchitectures are prefix codes. Prefix codes are not error-correcting codes. In practice, a message might first be compressed with a prefix code, and then encoded again with

channel coding In computing, telecommunication, information theory, and coding theory, forward error correction (FEC) or channel coding is a technique used for error control, controlling errors in data transmission over unreliable or noisy communication channel ...

(including error correction) before transmission. For any uniquely decodable code there is a prefix code that has the same code word lengths.Le Boudec, Jean-Yves, Patrick Thiran, and Rüdiger Urbanke. Introduction aux sciences de l'information: entropie, compression, chiffrement et correction d'erreurs. PPUR Presses polytechniques, 2015. Kraft's inequality characterizes the sets of code word lengths that are possible in a uniquely decodable code.Berstel et al (2010) p.75

Techniques

If every word in the code has the same length, the code is called a fixed-length code, or a block code (though the term

block code In coding theory, block codes are a large and important family of Channel coding, error-correcting codes that encode data in blocks. There is a vast number of examples for block codes, many of which have a wide range of practical applications. Th ...

is also used for fixed-size

error-correcting code In computing, telecommunication, information theory, and coding theory, forward error correction (FEC) or channel coding is a technique used for controlling errors in data transmission over unreliable or noisy communication channels. The centra ...

s in

). For example,

ISO 8859-15 ISO/IEC 8859-15:1999, ''Information technology — 8-bit single-byte coded graphic character sets — Part 15: Latin alphabet No. 9'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1999. ...

letters are always 8 bits long.

UTF-32/UCS-4 UTF-32 (32- bit Unicode Transformation Format), sometimes called UCS-4, is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far ...

letters are always 32 bits long. ATM cells are always 424 bits (53 bytes) long. A fixed-length code of fixed length ''k'' bits can encode up to

2^

source symbols. A fixed-length code is necessarily a prefix code. It is possible to turn any code into a fixed-length code by padding fixed symbols to the shorter prefixes in order to meet the length of the longest prefixes. Alternately, such padding codes may be employed to introduce redundancy that allows autocorrection and/or synchronisation. However, fixed length encodings are inefficient in situations where some words are much more likely to be transmitted than others.

Truncated binary encoding Truncated binary encoding is an entropy encoding typically used for uniform probability distributions with a finite alphabet. It is parameterized by an alphabet with total size of number ''n''. It is a slightly more general form of binary encodin ...

is a straightforward generalization of fixed-length codes to deal with cases where the number of symbols ''n'' is not a power of two. Source symbols are assigned codewords of length ''k'' and ''k''+1, where ''k'' is chosen so that ''2^k < n ≤ 2^k+1''.

is a more sophisticated technique for constructing variable-length prefix codes. The Huffman coding algorithm takes as input the frequencies that the code words should have, and constructs a prefix code that minimizes the weighted average of the code word lengths. (This is closely related to minimizing the entropy.) This is a form of

lossless data compression Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits Redundanc ...

based on

entropy encoding In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method ...

. Some codes mark the end of a code word with a special "comma" symbol (also called a

Sentinel value In computer programming, a sentinel value (also referred to as a flag value, trip value, rogue value, signal value, or dummy data) is a special value in the context of an algorithm which uses its presence as a condition of termination, typically ...

), different from normal data. This is somewhat analogous to the spaces between words in a sentence; they mark where one word ends and another begins. If every code word ends in a comma, and the comma does not appear elsewhere in a code word, the code is automatically prefix-free. However, reserving an entire symbol only for use as a comma can be inefficient, especially for languages with a small number of symbols.

Morse code Morse code is a telecommunications method which Character encoding, encodes Written language, text characters as standardized sequences of two different signal durations, called ''dots'' and ''dashes'', or ''dits'' and ''dahs''. Morse code i ...

is an everyday example of a variable-length code with a comma. The long pauses between letters, and the even longer pauses between words, help people recognize where one letter (or word) ends, and the next begins. Similarly,

Fibonacci coding In mathematics and computing, Fibonacci coding is a universal code which encodes positive integers into binary code words. It is one example of representations of integers based on Fibonacci numbers. Each code word ends with "11" and contains n ...

uses a "11" to mark the end of every code word.

Self-synchronizing code In coding theory, especially in telecommunications, a self-synchronizing code is a uniquely decodable code in which the symbol stream formed by a portion of one code word, or by the overlapped portion of any two adjacent code words, is not a ...

s are prefix codes that allow

frame synchronization In telecommunications, frame synchronization or framing is the process by which, while receiving a stream of fixed-length frames, the receiver identifies the frame boundaries, permitting the data bits within the frame to be extracted for decodin ...

Related concepts

A suffix code is a set of words none of which is a suffix of any other; equivalently, a set of words which are the reverse of a prefix code. As with a prefix code, the representation of a string as a concatenation of such words is unique. A bifix code is a set of words which is both a prefix and a suffix code.Berstel et al (2010) p.58 An optimal prefix code is a prefix code with minimal average length. That is, assume an alphabet of symbols with probabilities

p(A_i)

for a prefix code . If is another prefix code and

\lambda'_i

are the lengths of the codewords of , then

\sum_^n  \leq \sum_^n  \!

Prefix codes in use today

Examples of prefix codes include: * variable-length Huffman codes * country calling codes * Chen–Ho encoding * the country and publisher parts of

s * the Secondary Synchronization Codes used in the

3G Wireless Standard * VCR Plus+ codes *

Unicode Transformation Format Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 ch ...

, in particular the

UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8. UTF-8 supports all 1,112,0 ...

system for encoding

Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...

characters, which is both a prefix-free code and a

variable-length quantity A variable-length quantity (VLQ) is a universal code that uses an arbitrary number of binary octets (eight- bit bytes) to represent an arbitrarily large integer. A VLQ is essentially a base-128 representation of an unsigned integer with the add ...

Techniques

Commonly used techniques for constructing prefix codes include Huffman codes and the earlier Shannon–Fano codes, and universal codes such as: * Elias delta coding * Elias gamma coding * Elias omega coding *

* Levenshtein coding * Unary coding * Golomb Rice code * Straddling checkerboard (simple cryptography technique which produces prefix codes) * binary coding

Notes

References

* * * D.A. Huffman, "A method for the construction of minimum-redundancy codes", Proceedings of the I.R.E., Sept. 1952, pp. 1098–1102 (Huffman's original article)
Profile: David A. Huffman

Scientific American ''Scientific American'', informally abbreviated ''SciAm'' or sometimes ''SA'', is an American popular science magazine. Many scientists, including Albert Einstein and Nikola Tesla, have contributed articles to it, with more than 150 Nobel Pri ...

, Sept. 1991, pp. 54–58 (Background story) * Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and

Clifford Stein Clifford Seth Stein (born December 14, 1965), a computer scientist, is a professor of industrial engineering and operations research at Columbia University in New York, NY, where he also holds an appointment in the Department of Computer Scien ...

. ''

Introduction to Algorithms ''Introduction to Algorithms'' is a book on computer programming by Thomas H. Cormen, Charles E. Leiserson, Ron Rivest, Ronald L. Rivest, and Clifford Stein. The book is described by its publisher as "the leading algorithms text in universities w ...

'', Second Edition. MIT Press and McGraw-Hill, 2001. . Section 16.3, pp. 385–392. *

External links

Codes, trees and the prefix property
by Kona Macphee {{Compression methods Coding theory

Data compression Lossless compression algorithms