Arithmetic coding (AC) is a form of

entropy encoding In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method ...

used in

lossless data compression Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits Redundanc ...

. Normally, a string of characters is represented using a fixed number of

bit The bit is the most basic unit of information in computing and digital communication. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represented as ...

s per character, as in the

ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...

code. When a string is converted to arithmetic encoding, frequently used characters will be stored with fewer bits and not-so-frequently occurring characters will be stored with more bits, resulting in fewer bits used in total. Arithmetic coding differs from other forms of entropy encoding, such as

Huffman coding In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by ...

, in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, an

arbitrary-precision In computer science, arbitrary-precision arithmetic, also called bignum arithmetic, multiple-precision arithmetic, or sometimes infinite-precision arithmetic, indicates that calculations are performed on numbers whose digits of precision are po ...

fraction ''q'', where . It represents the current information as a range, defined by two numbers. A recent family of entropy coders called

asymmetric numeral systems Asymmetric numeral systems (ANS)J. Duda, K. Tahboub, N. J. Gadil, E. J. Delp''The use of asymmetric numeral systems as an accurate replacement for Huffman coding'' Picture Coding Symposium, 2015.J. Duda''Asymmetric numeral systems: entropy coding ...

allows for faster implementations thanks to directly operating on a single natural number representing the current information.J. Duda, K. Tahboub, N. J. Gadil, E. J. Delp, ''The use of asymmetric numeral systems as an accurate replacement for Huffman coding''
Picture Coding Symposium, 2015. Arithmetic coding example

Implementation details and examples

Equal probabilities

In the simplest case, the probability of each symbol occurring is equal. For example, consider a set of three symbols, A, B, and C, each equally likely to occur. Encoding the symbols one by one would require 2 bits per symbol, which is wasteful: one of the bit variations is never used. That is to say, symbols A, B and C might be encoded respectively as 00, 01 and 10, with 11 unused. A more efficient solution is to represent a sequence of these three symbols as a rational number in base 3 where each digit represents a symbol. For example, the sequence "ABBCAB" could become 0.011201₃, in arithmetic coding as a value in the interval [0, 1). The next step is to encode this ternary number using a fixed-point binary number of sufficient precision to recover it, such as 0.0010110001₂ – this is only 10 bits; 2 bits are saved in comparison with naïve block encoding. This is feasible for long sequences because there are efficient, in-place algorithms for converting the base of arbitrarily precise numbers. To decode the value, knowing the original string had length 6, one can simply convert back to base 3, round to 6 digits, and recover the string.

Defining a model

In general, arithmetic coders can produce near-optimal output for any given set of symbols and probabilities. (The optimal value is −log₂''P'' bits for each symbol of probability ''P''; see ''

Source coding theorem In information theory, Shannon's source coding theorem (or noiseless coding theorem) establishes the statistical limits to possible data compression for data whose source is an independent identically-distributed random variable, and the opera ...

''.) Compression algorithms that use arithmetic coding start by determining a

model A model is an informative representation of an object, person, or system. The term originally denoted the plans of a building in late 16th-century English, and derived via French and Italian ultimately from Latin , . Models can be divided in ...

of the data – basically a prediction of what patterns will be found in the symbols of the message. The more accurate this prediction is, the closer to optimal the output will be. Example: a simple, static model for describing the output of a particular monitoring instrument over time might be: *60% chance of symbol NEUTRAL *20% chance of symbol POSITIVE *10% chance of symbol NEGATIVE *10% chance of symbol END-OF-DATA. ''(The presence of this symbol means that the stream will be 'internally terminated', as is fairly common in data compression; when this symbol appears in the data stream, the decoder will know that the entire stream has been decoded.)'' Models can also handle alphabets other than the simple four-symbol set chosen for this example. More sophisticated models are also possible: ''higher-order'' modelling changes its estimation of the current probability of a symbol based on the symbols that precede it (the ''context''), so that in a model for English text, for example, the percentage chance of "u" would be much higher when it follows a "Q" or a "q". Models can even be '' adaptive'', so that they continually change their prediction of the data based on what the stream actually contains. The decoder must have the same model as the encoder.

Encoding and decoding: overview

In general, each step of the encoding process, except for the last, is the same; the encoder has basically just three pieces of data to consider: * The next symbol that needs to be encoded * The current interval (at the very start of the encoding process, the interval is set to ,1/nowiki>, but that will change) * The probabilities the model assigns to each of the various symbols that are possible at this stage (as mentioned earlier, higher-order or adaptive models mean that these probabilities are not necessarily the same in each step.) The encoder divides the current interval into sub-intervals, each representing a fraction of the current interval proportional to the probability of that symbol in the current context. Whichever interval corresponds to the actual symbol that is next to be encoded becomes the interval used in the next step. Example: for the four-symbol model above: * the interval for NEUTRAL would be [0, 0.6) * the interval for POSITIVE would be [0.6, 0.8) * the interval for NEGATIVE would be [0.8, 0.9) * the interval for END-OF-DATA would be [0.9, 1). When all symbols have been encoded, the resulting interval unambiguously identifies the sequence of symbols that produced it. Anyone who has the same final interval and model that is being used can reconstruct the symbol sequence that must have entered the encoder to result in that final interval. It is not necessary to transmit the final interval, however; it is only necessary to transmit ''one fraction'' that lies within that interval. In particular, it is only necessary to transmit enough digits (in whatever base) of the fraction so that all fractions that begin with those digits fall into the final interval; this will guarantee that the resulting code is a prefix code.

Encoding and decoding: example

Consider the process for decoding a message encoded with the given four-symbol model. The message is encoded in the fraction 0.538 (using decimal for clarity, instead of binary; also assuming that there are only as many digits as needed to decode the message.) The process starts with the same interval used by the encoder:

s; the same message could have been encoded in the binary fraction 0.10001001 (equivalent to 0.53515625 decimal) at a cost of only 8bits. This 8 bit output is larger than the information content, or entropy Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...

Implementation details and examples

Equal probabilities

Defining a model

Encoding and decoding: overview

Encoding and decoding: example

Adaptive arithmetic coding

Precision and renormalization

Arithmetic coding as a generalized change of radix

Asymptotic equipartition

Connections with other compression methods

Huffman coding

History and patents

Benchmarks and other technical characteristics

See also

Notes

References

External links