Dictionary Encoding
   HOME

TheInfoList



OR:

A dictionary coder, also sometimes known as a substitution coder, is a class of
lossless data compression Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits Redundanc ...
algorithms which operate by searching for matches between the text to be compressed and a set of
string String or strings may refer to: *String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects Arts, entertainment, and media Films * ''Strings'' (1991 film), a Canadian anim ...
s contained in a
data structure In computer science, a data structure is a data organization and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the relationships amo ...
(called the 'dictionary') maintained by the encoder. When the encoder finds such a match, it substitutes a reference to the string's position in the data structure.


Methods and applications

Some dictionary coders use a 'static dictionary', one whose full set of strings is determined before coding begins and does not change during the coding process. This approach is most often used when the message or set of messages to be encoded is fixed and large; for instance, an application that stores the contents of a book in the limited storage space of a PDA generally builds a static dictionary from a concordance of the text and then uses that dictionary to compress the verses. This scheme of using
Huffman coding In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by ...
to represent indices into a concordance has been called "Huffword". In a related and more general method, a dictionary is built from redundancy extracted from a data environment (various input streams) which dictionary is then used statically to compress a further input stream. For example, a dictionary is built from old English texts then is used to compress a book. More common are methods where the dictionary starts in some predetermined state but the contents change during the encoding process, based on the data that has already been encoded. Both the
LZ77 LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as Lempel-Ziv 1 (LZ1) and Lempel-Ziv 2 (LZ2) respectively. These two algorithms form the basis ...
and LZ78 algorithms work on this principle. In LZ77, a
circular buffer In computer science, a circular buffer, circular queue, cyclic buffer or ring buffer is a data structure that uses a single, fixed-size buffer as if it were connected end-to-end. This structure lends itself easily to buffering data streams. The ...
called the "sliding window" holds the last ''N'' bytes of data processed. This window serves as the dictionary, effectively storing ''every'' substring that has appeared in the past ''N'' bytes as dictionary entries. Instead of a single index identifying a dictionary entry, two values are needed: the ''length'', indicating the length of the matched text, and the ''offset'' (also called the ''distance''), indicating that the match is found in the sliding window starting ''offset'' bytes before the current text. LZ78 uses a more explicit dictionary structure; at the beginning of the encoding process, the dictionary is empty. An index value of zero is used to represent the end of a string, so the first index of the dictionary is one. At each step of the encoding process, if there is no match, then the last matching index (or zero) and character are both added to the dictionary and output to the compressed stream. If there is a match, then the working index is updated to the matching index, and nothing is output. LZW is similar to LZ78, but, the dictionary is initialized to all possible symbols. The typical implementation works with 8 bit symbols, so the dictionary "codes" for hex 00 to hex FF (decimal 255) are pre-defined. Dictionary entries would be added starting with code value hex 100. Unlike LZ78, if a match is not found (or if the end of data), then only the dictionary code is output. This creates a potential issue since the decoder output is one step behind the dictionary. Refer to LZW for how this is handled. Enhancements to LZW include handing symbol sizes other than 8 bits and having reserved codes to reset the dictionary and to indicate end of data.
Brotli Brotli is a lossless data compression algorithm developed by Jyrki Alakuijala and Zoltán Szabadka. It uses a combination of the general-purpose LZ77 lossless compression algorithm, Huffman coding and 2nd-order context modelling. Brotli is pr ...
is an example of a commonly-used coder that is initialised with a pre-defined dictionary, but later goes on to use more sophisticated content modelling. The Brotli dictionary consists largely of natural-language words and HTML and JavaScript fragments, based on an analysis of web traffic.


References


See also

*
Grammar-based code Grammar-based codes or grammar-based compression are Data compression, compression algorithms based on the idea of constructing a context-free grammar (CFG) for the string to be compressed. Examples include universal lossless data compression algo ...
*
Entropy encoding In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method ...
{{Compression Methods Lossless compression algorithms Data compression