Theoretical efficiency
In the second of the two papers that introduced these algorithms they are analyzed as encoders defined by finite-state machines. A measure analogous to information entropy is developed for individual sequences (as opposed to probabilistic ensembles). This measure gives a bound on the data compression ratio that can be achieved. It is then shown that there exists finite lossless encoders for every sequence that achieve this bound as the length of the sequence grows to infinity. In this sense an algorithm based on this scheme produces asymptotically optimal encodings. This result can be proven more directly, as for example in notes by Peter Shor.LZ77
LZ77 algorithms achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the uncompressed data stream. A match is encoded by a pair of numbers called a ''length-distance pair'', which is equivalent to the statement "each of the next ''length'' characters is equal to the characters exactly ''distance'' characters behind it in the uncompressed stream". (The ''distance'' is sometimes called the ''offset'' instead.) To spot matches, the encoder must keep track of some amount of the most recent data, such as the last 2 KB, 4 KB, or 32 KB. The structure in which this data is held is called a ''sliding window'', which is why LZ77 is sometimes called ''sliding-window compression''. The encoder needs to keep this data to look for matches, and the decoder needs to keep this data to interpret the matches the encoder refers to. The larger the sliding window is, the longer back the encoder may search for creating references. It is not only acceptable but frequently useful to allow length-distance pairs to specify a length that actually exceeds the distance. As a copy command, this is puzzling: "Go back ''four'' characters and copy ''ten'' characters from that position into the current position". How can ten characters be copied over when only four of them are actually in the buffer? Tackling one byte at a time, there is no problem serving this request, because as a byte is copied over, it may be fed again as input to the copy command. When the copy-from position makes it to the initial destination position, it is consequently fed data that was pasted from the ''beginning'' of the copy-from position. The operation is thus equivalent to the statement "copy the data you were given and repetitively paste it until it fits". As this type of pair repeats a single copy of data multiple times, it can be used to incorporate a flexible and easy form ofPseudocode
The pseudocode is a reproduction of the LZ77 compression algorithm sliding window. while input is not empty do match := longest repeated occurrence of input that begins in window if match exists then d := distance to start of match l := length of match c := char following match in input else d := 0 l := 0 c := first char of input end if output (d, l, c) discard ''l'' + 1 chars from front of window s := pop ''l'' + 1 chars from front of input append s to back of window repeatImplementations
Even though all LZ77 algorithms work by definition on the same basic principle, they can vary widely in how they encode their compressed data to vary the numerical ranges of a length–distance pair, alter the number of bits consumed for a length–distance pair, and distinguish their length–distance pairs from ''literals'' (raw data encoded as itself, rather than as part of a length–distance pair). A few examples: * The algorithm illustrated in Lempel and Ziv's original 1977 article outputs all its data three values at a time: the length and distance of the longest match found in the buffer, and the literal that followed that match. If two successive characters in the input stream could be encoded only as literals, the length of the length–distance pair would be 0. * LZSS improves on LZ77 by using a 1-bit flag to indicate whether the next chunk of data is a literal or a length–distance pair, and using literals if a length–distance pair would be longer. * In the PalmDoc format, a length–distance pair is always encoded by a two-byte sequence. Of the 16 bits that make up these two bytes, 11 bits go to encoding the distance, 3 go to encoding the length, and the remaining two are used to make sure the decoder can identify the first byte as the beginning of such a two-byte sequence. * In the implementation used for many games byLZ78
The LZ78 algorithms compress sequential data by building a dictionary of token sequences from the input, and then replacing the second and subsequent occurrence of the sequence in the data stream with a reference to the dictionary entry. The observation is that the number of repeated sequences is a good measure of the non random nature of a sequence. The algorithms represent the dictionary as an n-ary tree where n is the number of tokens used to form token sequences. Each dictionary entry is of the form , where index is the index to a dictionary entry representing a previously seen sequence, and token is the next token from the input that makes this entry unique in the dictionary. Note how the algorithm is greedy, and so nothing is added to the table until a unique making token is found. The algorithm is to initialize last matching index = 0 and next available index = 1 and then, for each token of the input stream, the dictionary searched for a match: . If a match is found, then last matching index is set to the index of the matching entry, nothing is output, and last matching index is left representing the input so far. Input is processed until a match is ''not'' found. Then a new dictionary entry is created, , and the algorithm outputs last matching index, followed by token, then resets last matching index = 0 and increments next available index. As an example consider the sequence of tokens which would assemble the dictionary; and the output sequence of the compressed data would be . Note that the last A is not represented yet as the algorithm cannot know what comes next. In practice an EOF marker is added to the input - for example. Note also that in this case the output is longer than the original input but compression ratio improves considerably as the dictionary grows, and in binary the indexes need not be represented by any more than the minimum number of bits.https://math.mit.edu/~goemans/18310S15/lempel-ziv-notes.pdf Decompression consists of rebuilding the dictionary from the compressed sequence. From the sequence the first entry is always the terminator , and the first from the sequence would be . The is added to the output. The second pair from the input is and results in entry number 2 in the dictionary, . The token "B" is output, preceded by the sequence represented by dictionary entry 1. Entry 1 is an 'A' (followed by "entry 0" - nothing) so is added to the output. Next is added to the dictionary as the next entry, , and B (preceded by nothing) is added to the output. Finally a dictionary entry for is created and is output resulting in or removing the spaces and EOF marker.LZW
LZW is an LZ78-based algorithm that uses a dictionary pre-initialized with all possible characters (symbols) or emulation of a pre-initialized dictionary. The main improvement of LZW is that when a match is not found, the current input stream character is assumed to be the first character of an existing string in the dictionary (since the dictionary is initialized with all possible characters), so only the ''last matching index'' is output (which may be the pre-initialized dictionary index corresponding to the previous (or the initial) input character). Refer to the LZW article for implementation details. BTLZ is an LZ78-based algorithm that was developed for use in real-time communications systems (originally modems) and standardized by CCITT/ITU as V.42bis. When theSee also
* Lempel–Ziv–Stac (LZS)References
External links
* * * * * {{DEFAULTSORT:Lz77 And Lz78 Lossless compression algorithms Israeli inventions Articles with example pseudocode