In
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
and
information theory
Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. ...
, a Huffman code is a particular type of optimal
prefix code that is commonly used for
lossless data compression
Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statist ...
. The process of finding or using such a code proceeds by means of Huffman coding, an algorithm developed by
David A. Huffman while he was a
Sc.D. student at
MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".
The output from Huffman's algorithm can be viewed as a
variable-length code table for encoding a source symbol (such as a character in a file). The algorithm derives this table from the estimated probability or frequency of occurrence (''weight'') for each possible value of the source symbol. As in other
entropy encoding methods, more common symbols are generally represented using fewer bits than less common symbols. Huffman's method can be efficiently implemented, finding a code in time
linear
Linearity is the property of a mathematical relationship ('' function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear ...
to the number of input weights if these weights are sorted. However, although optimal among methods encoding symbols separately, Huffman coding
is not always optimal among all compression methods - it is replaced with
arithmetic coding or
asymmetric numeral systems[J. Duda, K. Tahboub, N. J. Gadil, E. J. Delp, ''The use of asymmetric numeral systems as an accurate replacement for Huffman coding'']
Picture Coding Symposium, 2015. if a better compression ratio is required.
History
In 1951,
David A. Huffman and his
MIT information theory
Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. ...
classmates were given the choice of a term paper or a final
exam. The professor,
Robert M. Fano, assigned a
term paper on the problem of finding the most efficient binary code. Huffman, unable to prove any codes were the most efficient, was about to give up and start studying for the final when he hit upon the idea of using a frequency-sorted
binary tree and quickly proved this method the most efficient.
In doing so, Huffman outdid Fano, who had worked with
Claude Shannon
Claude Elwood Shannon (April 30, 1916 – February 24, 2001) was an American mathematician, electrical engineer, and cryptographer known as a "father of information theory".
As a 21-year-old master's degree student at the Massachusetts In ...
to develop a similar code. Building the tree from the bottom up guaranteed optimality, unlike the top-down approach of
Shannon–Fano coding.
Terminology
Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a
prefix code (sometimes called "prefix-free codes", that is, the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol). Huffman coding is such a widespread method for creating prefix codes that the term "Huffman code" is widely used as a synonym for "prefix code" even when such a code is not produced by Huffman's algorithm.
Problem definition
Informal description
;Given: A set of symbols and their weights (usually
proportional
Proportionality, proportion or proportional may refer to:
Mathematics
* Proportionality (mathematics), the property of two variables being in a multiplicative relation to a constant
* Ratio, of one quantity to another, especially of a part compare ...
to probabilities).
;Find: A
prefix-free binary code (a set of codewords) with minimum
expected codeword length (equivalently, a tree with minimum
weighted path length from the root).
Formalized description
Input.
Alphabet
, which is the symbol alphabet of size
.
Tuple
, which is the tuple of the (positive) symbol weights (usually proportional to probabilities), i.e.
.
Output.
Code
, which is the tuple of (binary) codewords, where
is the codeword for
.
Goal.
Let
be the weighted path length of code
. Condition:
for any code
.
Example
We give an example of the result of Huffman coding for a code with five characters and given weights. We will not verify that it minimizes ''L'' over all codes, but we will compute ''L'' and compare it to the
Shannon entropy
Shannon may refer to:
People
* Shannon (given name)
* Shannon (surname)
* Shannon (American singer), stage name of singer Shannon Brenda Greene (born 1958)
* Shannon (South Korean singer), British-South Korean singer and actress Shannon Arrum Will ...
''H'' of the given set of weights; the result is nearly optimal.
For any code that is ''biunique'', meaning that the code is
''uniquely decodeable'', the sum of the probability budgets across all symbols is always less than or equal to one. In this example, the sum is strictly equal to one; as a result, the code is termed a ''complete'' code. If this is not the case, one can always derive an equivalent code by adding extra symbols (with associated null probabilities), to make the code complete while keeping it ''biunique''.
As defined by
Shannon (1948), the information content ''h'' (in bits) of each symbol ''a''
i with non-null probability is
:
The
entropy
Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodyna ...
''H'' (in bits) is the weighted sum, across all symbols with non-zero probability , of the information content of each symbol:
:
(Note: A symbol with zero probability has zero contribution to the entropy, since
. So for simplicity, symbols with zero probability can be left out of the formula above.)
As a consequence of
Shannon's source coding theorem, the entropy is a measure of the smallest codeword length that is theoretically possible for the given alphabet with associated weights. In this example, the weighted average codeword length is 2.25 bits per symbol, only slightly larger than the calculated entropy of 2.205 bits per symbol. So not only is this code optimal in the sense that no other feasible code performs better, but it is very close to the theoretical limit established by Shannon.
In general, a Huffman code need not be unique. Thus the set of Huffman codes for a given probability distribution is a non-empty subset of the codes minimizing
for that probability distribution. (However, for each minimizing codeword length assignment, there exists at least one Huffman code with those lengths.)
Basic technique
Compression

The technique works by creating a
binary tree of nodes. These can be stored in a regular
array
An array is a systematic arrangement of similar objects, usually in rows and columns.
Things called an array include:
{{TOC right
Music
* In twelve-tone and serial composition, the presentation of simultaneous twelve-tone sets such that the ...
, the size of which depends on the number of symbols,
. A node can be either a
leaf node or an
internal node. Initially, all nodes are leaf nodes, which contain the symbol itself, the weight (frequency of appearance) of the symbol and optionally, a link to a parent node which makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain a weight, links to two child nodes and an optional link to a parent node. As a common convention, bit '0' represents following the left child and bit '1' represents following the right child. A finished tree has up to
leaf nodes and
internal nodes. A Huffman tree that omits unused symbols produces the most optimal code lengths.
The process begins with the leaf nodes containing the probabilities of the symbol they represent. Then, the process takes the two nodes with smallest probability, and creates a new internal node having these two nodes as children. The weight of the new node is set to the sum of the weight of the children. We then apply the process again, on the new internal node and on the remaining nodes (i.e., we exclude the two leaf nodes), we repeat this process until only one node remains, which is the root of the Huffman tree.
The simplest construction algorithm uses a
priority queue
In computer science, a priority queue is an abstract data-type similar to a regular queue or stack data structure in which each element additionally has a ''priority'' associated with it. In a priority queue, an element with high priority is se ...
where the node with lowest probability is given highest priority:
# Create a leaf node for each symbol and add it to the priority queue.
# While there is more than one node in the queue:
## Remove the two nodes of highest priority (lowest probability) from the queue
## Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
## Add the new node to the queue.
# The remaining node is the root node and the tree is complete.
Since efficient priority queue data structures require O(log ''n'') time per insertion, and a tree with ''n'' leaves has 2''n''−1 nodes, this algorithm operates in O(''n'' log ''n'') time, where ''n'' is the number of symbols.
If the symbols are sorted by probability, there is a
linear-time (O(''n'')) method to create a Huffman tree using two
queues, the first one containing the initial weights (along with pointers to the associated leaves), and combined weights (along with pointers to the trees) being put in the back of the second queue. This assures that the lowest weight is always kept at the front of one of the two queues:
#Start with as many leaves as there are symbols.
#Enqueue all leaf nodes into the first queue (by probability in increasing order so that the least likely item is in the head of the queue).
#While there is more than one node in the queues:
##Dequeue the two nodes with the lowest weight by examining the fronts of both queues.
##Create a new internal node, with the two just-removed nodes as children (either node can be either child) and the sum of their weights as the new weight.
##Enqueue the new node into the rear of the second queue.
#The remaining node is the root node; the tree has now been generated.
Once the Huffman tree has been generated, it is traversed to generate a dictionary which maps the symbols to binary codes as follows:
#Start with current node set to the root.
#If node is not a leaf node, label the edge to the left child as ''0'' and the edge to the right child as ''1''. Repeat the process at both the left child and the right child.
The final encoding of any symbol is then read by a concatenation of the labels on the edges along the path from the root node to the symbol.
In many cases, time complexity is not very important in the choice of algorithm here, since ''n'' here is the number of symbols in the alphabet, which is typically a very small number (compared to the length of the message to be encoded); whereas complexity analysis concerns the behavior when ''n'' grows to be very large.
It is generally beneficial to minimize the variance of codeword length. For example, a communication buffer receiving Huffman-encoded data may need to be larger to deal with especially long symbols if the tree is especially unbalanced. To minimize variance, simply break ties between queues by choosing the item in the first queue. This modification will retain the mathematical optimality of the Huffman coding while both minimizing variance and minimizing the length of the longest character code.
Decompression
Generally speaking, the process of decompression is simply a matter of translating the stream of prefix codes to individual byte values, usually by traversing the Huffman tree node by node as each bit is read from the input stream (reaching a leaf node necessarily terminates the search for that particular byte value). Before this can take place, however, the Huffman tree must be somehow reconstructed. In the simplest case, where character frequencies are fairly predictable, the tree can be preconstructed (and even statistically adjusted on each compression cycle) and thus reused every time, at the expense of at least some measure of compression efficiency. Otherwise, the information to reconstruct the tree must be sent a priori. A naive approach might be to prepend the frequency count of each character to the compression stream. Unfortunately, the overhead in such a case could amount to several kilobytes, so this method has little practical use. If the data is compressed using
canonical encoding, the compression model can be precisely reconstructed with just
bits of information (where is the number of bits per symbol). Another method is to simply prepend the Huffman tree, bit by bit, to the output stream. For example, assuming that the value of 0 represents a parent node and 1 a leaf node, whenever the latter is encountered the tree building routine simply reads the next 8 bits to determine the character value of that particular leaf. The process continues recursively until the last leaf node is reached; at that point, the Huffman tree will thus be faithfully reconstructed. The overhead using such a method ranges from roughly 2 to 320 bytes (assuming an 8-bit alphabet). Many other techniques are possible as well. In any case, since the compressed data can include unused "trailing bits" the decompressor must be able to determine when to stop producing output. This can be accomplished by either transmitting the length of the decompressed data along with the compression model or by defining a special code symbol to signify the end of input (the latter method can adversely affect code length optimality, however).
Main properties
The probabilities used can be generic ones for the application domain that are based on average experience, or they can be the actual frequencies found in the text being compressed.
This requires that a
frequency table must be stored with the compressed text. See the Decompression section above for more information about the various techniques employed for this purpose.
Optimality
Huffman's original algorithm is optimal for a symbol-by-symbol coding with a known input probability distribution, i.e., separately encoding unrelated symbols in such a data stream. However, it is not optimal when the symbol-by-symbol restriction is dropped, or when the
probability mass functions are unknown. Also, if symbols are not
independent and identically distributed
In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is usua ...
, a single code may be insufficient for optimality. Other methods such as
arithmetic coding often have better compression capability.
Although both aforementioned methods can combine an arbitrary number of symbols for more efficient coding and generally adapt to the actual input statistics, arithmetic coding does so without significantly increasing its computational or algorithmic complexities (though the simplest version is slower and more complex than Huffman coding). Such flexibility is especially useful when input probabilities are not precisely known or vary significantly within the stream. However, Huffman coding is usually faster and arithmetic coding was historically a subject of some concern over
patent
A patent is a type of intellectual property that gives its owner the legal right to exclude others from making, using, or selling an invention for a limited period of time in exchange for publishing an sufficiency of disclosure, enabling disclo ...
issues. Thus many technologies have historically avoided arithmetic coding in favor of Huffman and other prefix coding techniques. As of mid-2010, the most commonly used techniques for this alternative to Huffman coding have passed into the public domain as the early patents have expired.
For a set of symbols with a uniform probability distribution and a number of members which is a
power of two
A power of two is a number of the form where is an integer, that is, the result of exponentiation with number two as the base and integer as the exponent.
In a context where only integers are considered, is restricted to non-negati ...
, Huffman coding is equivalent to simple binary
block encoding
In coding theory, block codes are a large and important family of error-correcting codes that encode data in blocks.
There is a vast number of examples for block codes, many of which have a wide range of practical applications. The abstract defini ...
, e.g.,
ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
coding. This reflects the fact that compression is not possible with such an input, no matter what the compression method, i.e., doing nothing to the data is the optimal thing to do.
Huffman coding is optimal among all methods in any case where each input symbol is a known independent and identically distributed random variable having a probability that is
dyadic
Dyadic describes the interaction between two things, and may refer to:
*Dyad (sociology), interaction between a pair of individuals
**The dyadic variation of Democratic peace theory
*Dyadic counterpoint, the voice-against-voice conception of polyp ...
. Prefix codes, and thus Huffman coding in particular, tend to have inefficiency on small alphabets, where probabilities often fall between these optimal (dyadic) points. The worst case for Huffman coding can happen when the probability of the most likely symbol far exceeds 2
−1 = 0.5, making the upper limit of inefficiency unbounded.
There are two related approaches for getting around this particular inefficiency while still using Huffman coding. Combining a fixed number of symbols together ("blocking") often increases (and never decreases) compression. As the size of the block approaches infinity, Huffman coding theoretically approaches the entropy limit, i.e., optimal compression. However, blocking arbitrarily large groups of symbols is impractical, as the complexity of a Huffman code is linear in the number of possibilities to be encoded, a number that is exponential in the size of a block. This limits the amount of blocking that is done in practice.
A practical alternative, in widespread use, is
run-length encoding. This technique adds one step in advance of entropy coding, specifically counting (runs) of repeated symbols, which are then encoded. For the simple case of
Bernoulli processes,
Golomb coding is optimal among prefix codes for coding run length, a fact proved via the techniques of Huffman coding. A similar approach is taken by fax machines using
modified Huffman coding. However, run-length coding is not as adaptable to as many input types as other compression technologies.
Variations
Many variations of Huffman coding exist, some of which use a Huffman-like algorithm, and others of which find optimal prefix codes (while, for example, putting different restrictions on the output). Note that, in the latter case, the method need not be Huffman-like, and, indeed, need not even be
polynomial time
In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...
.
''n''-ary Huffman coding
The ''n''-ary Huffman algorithm uses the alphabet to encode message and build an ''n''-ary tree. This approach was considered by Huffman in his original paper. The same algorithm applies as for binary (
) codes, except that the ''n'' least probable symbols are taken together, instead of just the 2 least probable. Note that for ''n'' greater than 2, not all sets of source words can properly form an ''n''-ary tree for Huffman coding. In these cases, additional 0-probability place holders must be added. This is because the tree must form an ''n'' to 1 contractor; for binary coding, this is a 2 to 1 contractor, and any sized set can form such a contractor. If the number of source words is congruent to 1 modulo ''n''−1, then the set of source words will form a proper Huffman tree.
Adaptive Huffman coding
A variation called
adaptive Huffman coding involves calculating the probabilities dynamically based on recent actual frequencies in the sequence of source symbols, and changing the coding tree structure to match the updated probability estimates. It is used rarely in practice, since the cost of updating the tree makes it slower than optimized
adaptive arithmetic coding, which is more flexible and has better compression.
Huffman template algorithm
Most often, the weights used in implementations of Huffman coding represent numeric probabilities, but the algorithm given above does not require this; it requires only that the weights form a
totally ordered commutative monoid, meaning a way to order weights and to add them. The Huffman template algorithm enables one to use any kind of weights (costs, frequencies, pairs of weights, non-numerical weights) and one of many combining methods (not just addition). Such algorithms can solve other minimization problems, such as minimizing