Lempel–Ziv–Welch
   HOME

TheInfoList



OR:

Lempel–Ziv–Welch (LZW) is a universal lossless data compression
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
created by
Abraham Lempel Abraham Lempel ( he, אברהם למפל, born 10 February 1936) is an Israeli computer scientist and one of the fathers of the LZ family of lossless data compression algorithms. Biography Lempel was born on 10 February 1936 in Lwów, Poland (n ...
,
Jacob Ziv Jacob Ziv ( he, יעקב זיו; born 1931) is an Israeli electrical engineer who, along with Abraham Lempel, developed the LZ family of lossless data compression algorithms. Biography Ziv was born in Tiberias, British mandate Palestine, on 27 ...
, and
Terry Welch Terry Archer Welch was an American computer scientist. Along with Abraham Lempel and Jacob Ziv, he developed the lossless Lempel–Ziv–Welch (LZW) compression algorithm, which was published in 1984. Education Welch received a B.S., M.S. and Ph. ...
. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is simple to implement and has the potential for very high throughput in hardware implementations. It is the algorithm of the widely used
Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, ...
file compression utility compress and is used in the GIF image format.


Algorithm

The scenario described by Welch's 1984 paper encodes sequences of 8-bit data as fixed-length 12-bit codes. The codes from 0 to 255 represent 1-character sequences consisting of the corresponding 8-bit character, and the codes 256 through 4095 are created in a dictionary for sequences encountered in the data as it is encoded. At each stage in compression, input bytes are gathered into a sequence until the next character would make a sequence with no code yet in the dictionary. The code for the sequence (without that character) is added to the output, and a new code (for the sequence with that character) is added to the dictionary. The idea was quickly adapted to other situations. In an image based on a color table, for example, the natural character alphabet is the set of color table indexes, and in the 1980s, many images had small color tables (on the order of 16 colors). For such a reduced alphabet, the full 12-bit codes yielded poor compression unless the image was large, so the idea of a variable-width code was introduced: codes typically start one bit wider than the symbols being encoded, and as each code size is used up, the code width increases by 1 bit, up to some prescribed maximum (typically 12 bits). When the maximum code value is reached, encoding proceeds using the existing table, but new codes are not generated for addition to the table. Further refinements include reserving a code to indicate that the code table should be cleared and restored to its initial state (a "clear code", typically the first value immediately after the values for the individual alphabet characters), and a code to indicate the end of data (a "stop code", typically one greater than the clear code). The clear code lets the table be reinitialized after it fills up, which lets the encoding adapt to changing patterns in the input data. Smart encoders can monitor the compression efficiency and clear the table whenever the existing table no longer matches the input well. Since codes are added in a manner determined by the data, the decoder mimics building the table as it sees the resulting codes. It is critical that the encoder and decoder agree on the variety of LZW used: the size of the alphabet, the maximum table size (and code width), whether variable-width encoding is used, initial code size, and whether to use the clear and stop codes (and what values they have). Most formats that employ LZW build this information into the format specification or provide explicit fields for them in a compression header for the data.


Encoding

A high-level view of the encoding algorithm is shown here: # Initialize the dictionary to contain all strings of length one. # Find the longest string W in the dictionary that matches the current input. # Emit the dictionary index for W to output and remove W from the input. # Add W followed by the next symbol in the input to the dictionary. # Go to Step 2. A dictionary is initialized to contain the single-character strings corresponding to all the possible input characters (and nothing else except the clear and stop codes if they're being used). The algorithm works by scanning through the input string for successively longer substrings until it finds one that is not in the dictionary. When such a string is found, the index for the string without the last character (i.e., the longest substring that ''is'' in the dictionary) is retrieved from the dictionary and sent to output, and the new string (including the last character) is added to the dictionary with the next available code. The last input character is then used as the next starting point to scan for substrings. In this way, successively longer strings are registered in the dictionary and available for subsequent encoding as single output values. The algorithm works best on data with repeated patterns, so the initial parts of a message see little compression. As the message grows, however, the
compression ratio The compression ratio is the ratio between the volume of the cylinder and combustion chamber in an internal combustion engine at their maximum and minimum values. A fundamental specification for such engines, it is measured two ways: the stati ...
tends asymptotically to the maximum (i.e., the compression factor or ratio improves on an increasing curve, and not linearly, approaching a theoretical maximum inside a limited time period rather than over infinite time).


Decoding

A high-level view of the decoding algorithm is shown here: # Initialize the dictionary to contain all strings of length one. # Read the next encoded symbol: Is it encoded in the dictionary? ##Yes: ###Emit the corresponding string W to output. ###Concatenate the previous string emitted to output with the first symbol of W. Add this to the dictionary. ##No: ###Concatenate the previous string emitted to output with its first symbol. Call this string V. ###Add V to the dictionary and emit V to output. #Repeat Step 2 until end of input string The decoding algorithm works by reading a value from the encoded input and outputting the corresponding string from the dictionary. However, the full dictionary is not needed, only the initial dictionary that contains single-character strings (and that is usually hard coded in the program, instead of sent with the encoded data). Instead, the full dictionary is rebuilt during the decoding process the following way: after decoding a value and outputting a string, the decoder concatenates it with the first character of the ''next'' decoded string (or the first character of current string, if the next one can't be decoded; since if the next value is unknown, then it must be the value added to the dictionary in ''this'' iteration, and so its first character is the same as the first character of the current string), and updates the dictionary with the new string. The decoder then proceeds to the next input (which was already read in the previous iteration) and processes it as before, and so on until it has exhausted the input stream.


Variable-width codes

If variable-width codes are being used, the encoder and decoder must be careful to change the width at the same points in the encoded data so they don't disagree on boundaries between individual codes in the stream. In the standard version, the encoder increases the width from ''p'' to ''p'' + 1 when a sequence ω + ''s'' is encountered that is not in the table (so that a code must be added for it) but the next available code in the table is 2''p'' (the first code requiring ''p'' + 1 bits). The encoder emits the code for ω at width ''p'' (since that code does not require ''p'' + 1 bits), and then increases the code width so that the next code emitted is ''p'' + 1 bits wide. The decoder is always one code behind the encoder in building the table, so when it sees the code for ω, it generates an entry for code 2''p'' − 1. Since this is the point where the encoder increases the code width, the decoder must increase the width here as well—at the point where it generates the largest code that fits in ''p'' bits. Unfortunately, some early implementations of the encoding algorithm increase the code width and ''then'' emit ω at the new width instead of the old width, so that to the decoder it looks like the width changes one code too early. This is called "early change"; it caused so much confusion that Adobe now allows both versions in PDF files, but includes an explicit flag in the header of each LZW-compressed stream to indicate whether early change is being used. Of the graphics file formats that support LZW compression,
TIFF Tag Image File Format, abbreviated TIFF or TIF, is an image file format for storing raster graphics images, popular among graphic artists, the publishing industry, and photographers. TIFF is widely supported by scanning, faxing, word process ...
uses early change, while GIF and most others don't. When the table is cleared in response to a clear code, both encoder and decoder change the code width after the clear code back to the initial code width, starting with the code immediately following the clear code.


Packing order

Since the codes emitted typically do not fall on byte boundaries, the encoder and decoder must agree on how codes are packed into bytes. The two common methods are ''LSB-first'' ("
least significant bit In computing, bit numbering is the convention used to identify the bit positions in a binary number. Bit significance and indexing In computing, the least significant bit (LSB) is the bit position in a binary integer representing the binar ...
first") and ''MSB-first'' (" most significant bit first"). In LSB-first packing, the first code is aligned so that the least significant bit of the code falls in the least significant bit of the first stream byte, and if the code has more than 8 bits, the high-order bits left over are aligned with the least significant bits of the next byte; further codes are packed with LSB going into the least significant bit not yet used in the current stream byte, proceeding into further bytes as necessary. MSB-first packing aligns the first code so that its ''most'' significant bit falls in the MSB of the first stream byte, with overflow aligned with the MSB of the next byte; further codes are written with MSB going into the most significant bit not yet used in the current stream byte. GIF files use LSB-first packing order. TIFF files and PDF files use MSB-first packing order.


Example

The following example illustrates the LZW algorithm in action, showing the status of the output and the
dictionary A dictionary is a listing of lexemes from the lexicon of one or more specific languages, often arranged alphabetically (or by radical and stroke for ideographic languages), which may include information on definitions, usage, etymologie ...
at every stage, both in encoding and decoding the data. This example has been constructed to give reasonable compression on a very short message. In real text data, repetition is generally less pronounced, so longer input streams are typically necessary before the compression builds up efficiency. The plaintext to be encoded (from an alphabet using only the capital letters) is: TOBEORNOTTOBEORTOBEORNOT# The # is a marker used to show that the end of the message has been reached. There are thus 26 symbols in the plaintext alphabet (the 26 capital letters ''A'' through ''Z''), and the ''#'' character represents a stop code. We arbitrarily assign these the values 1 through 26 for the letters, and 0 for '#'. (Most flavors of LZW would put the stop code ''after'' the data alphabet, but nothing in the basic algorithm requires that. The encoder and decoder only have to agree what value it has.) A computer renders these as strings of bits. Five-bit codes are needed to give sufficient combinations to encompass this set of 27 values. The dictionary is initialized with these 27 values. As the dictionary grows, the codes must grow in width to accommodate the additional entries. A 5-bit code gives 25 = 32 possible combinations of bits, so when the 33rd dictionary word is created, the algorithm must switch at that point from 5-bit strings to 6-bit strings (for ''all'' code values, including those previously output with only five bits). Note that since the all-zero code 00000 is used, and is labeled "0", the 33rd dictionary entry is labeled 32. (Previously generated output is not affected by the code-width change, but once a 6-bit value is generated in the dictionary, it could conceivably be the next code emitted, so the width for subsequent output shifts to 6 bits to accommodate that.) The initial dictionary, then, consists of the following entries:


Encoding

Buffer input characters in a sequence ω until ω + next character is not in the dictionary. Emit the code for ω, and add ω + next character to the dictionary. Start buffering again with the next character. (The string to be encoded is "TOBEORNOTTOBEORTOBEORNOT#".) :Unencoded length = 25 symbols × 5 bits/symbol = 125 bits :Encoded length = (6 codes × 5 bits/code) + (11 codes × 6 bits/code) = 96 bits. Using LZW has saved 29 bits out of 125, reducing the message by more than 23%. If the message were longer, then the dictionary words would begin to represent longer and longer sections of text, sending repeated words very compactly.


Decoding

To decode an LZW-compressed archive, one needs to know in advance the initial dictionary used, but additional entries can be reconstructed as they are always simply
concatenation In formal language theory and computer programming, string concatenation is the operation of joining character strings end-to-end. For example, the concatenation of "snow" and "ball" is "snowball". In certain formalisations of concatenat ...
s of previous entries. At each stage, the decoder receives a code X; it looks X up in the table and outputs the sequence χ it codes, and it conjectures χ + ? as the entry the encoder just added – because the encoder emitted X for χ precisely because χ + ? was not in the table, and the encoder goes ahead and adds it. But what is the missing letter? It is the first letter in the sequence coded by the ''next'' code Z that the decoder receives. So the decoder looks up Z, decodes it into the sequence ω and takes the first letter z and tacks it onto the end of χ as the next dictionary entry. This works as long as the codes received are in the decoder's dictionary, so that they can be decoded into sequences. What happens if the decoder receives a code Z that is not yet in its dictionary? Since the decoder is always just one code behind the encoder, Z can be in the encoder's dictionary only if the encoder ''just'' generated it, when emitting the previous code X for χ. Thus Z codes some ω that is χ + ?, and the decoder can determine the unknown character as follows: # The decoder sees X and then Z, where X codes the sequence χ and Z codes some unknown sequence ω. # The decoder knows that the encoder just added Z as a code for χ + some unknown character ''c'', so ω = χ + ''c''. # Since ''c'' is the first character in the input stream after χ, and since ω is the string appearing immediately after χ, ''c'' must be the first character of the sequence ω. # Since χ is an initial substring of ω, ''c'' must also be the first character of χ. # So even though the Z code is not in the table, the decoder is able to infer the unknown sequence and adds χ + (the first character of χ) to the table as the value of Z. This situation occurs whenever the encoder encounters input of the form ''cScSc'', where ''c'' is a single character, ''S'' is a string and ''cS'' is already in the dictionary, but ''cSc'' is not. The encoder emits the code for ''cS'', putting a new code for ''cSc'' into the dictionary. Next it sees ''cSc'' in the input (starting at the second ''c'' of ''cScSc'') and emits the new code it just inserted. The argument above shows that whenever the decoder receives a code not in its dictionary, the situation must look like this. Although input of form ''cScSc'' might seem unlikely, this pattern is fairly common when the input stream is characterized by significant repetition. In particular, long strings of a single character (which are common in the kinds of images LZW is often used to encode) repeatedly generate patterns of this sort.


Further coding

The simple scheme described above focuses on the LZW algorithm itself. Many applications apply further encoding to the sequence of output symbols. Some package the coded stream as printable characters using some form of
binary-to-text encoding A binary-to-text encoding is encoding of data in plain text. More precisely, it is an encoding of binary data in a sequence of printable characters. These encodings are necessary for transmission of data when the channel does not allow binary ...
; this increases the encoded length and decreases the compression rate. Conversely, increased compression can often be achieved with an ''adaptive entropy encoder''. Such a coder estimates the probability distribution for the value of the next symbol, based on the observed frequencies of values so far. A standard entropy encoding such as
Huffman coding In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algo ...
or arithmetic coding then uses shorter codes for values with higher probabilities.


Uses

LZW compression became the first widely used universal data compression method on computers. A large
English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ...
text file can typically be compressed via LZW to about half its original size. LZW was used in the public-domain program compress, which became a more or less standard utility in
Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, ...
systems around 1986. It has since disappeared from many distributions, both because it infringed the LZW patent and because gzip produced better compression ratios using the LZ77-based DEFLATE algorithm, but as of 2008 at least FreeBSD includes both compress and uncompress as a part of the distribution. Several other popular compression utilities also used LZW or closely related methods. LZW became very widely used when it became part of the GIF image format in 1987. It may also (optionally) be used in
TIFF Tag Image File Format, abbreviated TIFF or TIF, is an image file format for storing raster graphics images, popular among graphic artists, the publishing industry, and photographers. TIFF is widely supported by scanning, faxing, word process ...
and
PDF Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. ...
files. (Although LZW is available in
Adobe Acrobat Adobe Acrobat is a family of application software and Web services developed by Adobe Inc. to view, create, manipulate, print and manage Portable Document Format (PDF) files. The family comprises Acrobat Reader (formerly Reader), Acrobat (forme ...
software, Acrobat by default uses DEFLATE for most text and color-table-based image data in PDF files.)


Patents

Various
patent A patent is a type of intellectual property that gives its owner the legal right to exclude others from making, using, or selling an invention for a limited period of time in exchange for publishing an enabling disclosure of the invention."A ...
s have been issued in the
United States The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country Continental United States, primarily located in North America. It consists of 50 U.S. state, states, a Washington, D.C., ...
and other countries for LZW and similar algorithms. LZ78 was covered by by Lempel, Ziv, Cohn, and Eastman, assigned to Sperry Corporation, later
Unisys Unisys Corporation is an American multinational information technology (IT) services and consulting company headquartered in Blue Bell, Pennsylvania. It provides digital workplace solutions, cloud, applications, and infrastructure solutions, ...
Corporation, filed on August 10, 1981. Two US patents were issued for the LZW algorithm: by
Victor S. Miller Victor Saul Miller (born 3 March 1947 in Brooklyn, New York) is an American mathematician as a Principal Computer Scientist in the Computer Science Laboratory of SRI International. He received his B.A. in mathematics from Columbia University in ...
and Mark N. Wegman and assigned to IBM, originally filed on June 1, 1983, and by Welch, assigned to Sperry Corporation, later Unisys Corporation, filed on June 20, 1983. In addition to the above patents, Welch's 1983 patent also includes citations to several other patents that influenced it, including two 1980 Japanese patents
JP9343880A
an
JP17790880A
from
NEC is a Japanese multinational information technology and electronics corporation, headquartered in Minato, Tokyo. The company was known as the Nippon Electric Company, Limited, before rebranding in 1983 as NEC. It provides IT and network soluti ...
's Jun Kanatsu, (1974) from John S. Hoerning, (1977) from Klaus E. Holtz, and a 1981 German patent
DE19813118676
from Karl Eckhart Heinz. In 1993–94, and again in 1999, Unisys Corporation received widespread condemnation when it attempted to enforce licensing fees for LZW in GIF images. The 1993–1994 Unisys-CompuServe controversy (
CompuServe CompuServe (CompuServe Information Service, also known by its initialism CIS) was an American online service provider, the first major commercial one in the world – described in 1994 as "the oldest of the Big Three information services (the oth ...
being the creator of the GIF format) prompted a
Usenet Usenet () is a worldwide distributed discussion system available on computers. It was developed from the general-purpose Unix-to-Unix Copy (UUCP) dial-up network architecture. Tom Truscott and Jim Ellis conceived the idea in 1979, and it wa ...
comp.graphics discussion ''Thoughts on a GIF-replacement file format'', which in turn fostered an email exchange that eventually culminated in the creation of the patent-unencumbered
Portable Network Graphics Portable Network Graphics (PNG, officially pronounced , colloquially pronounced ) is a raster-graphics file format that supports lossless data compression. PNG was developed as an improved, non-patented replacement for Graphics Interchange ...
(PNG) file format in 1995. Unisys's US patent on the LZW algorithm expired on June 20, 2003, 20 years after it had been filed. Patents that had been filed in the United Kingdom, France, Germany, Italy, Japan and Canada all expired in 2004, likewise 20 years after they had been filed.


Variants

* LZMW (1985, by V. Miller, M. Wegman) – Searches input for the longest string already in the dictionary (the "current" match); adds the concatenation of the previous match with the current match to the dictionary. (Dictionary entries thus grow more rapidly; but this scheme is much more complicated to implement.) Miller and Wegman also suggest deleting low-frequency entries from the dictionary when the dictionary fills up. * LZAP (1988, by James Storer)David Salomon, ''Data Compression – The complete reference'', 4th ed., page 212. – modification of LZMW: instead of adding just the concatenation of the previous match with the current match to the dictionary, add the concatenations of the previous match with each initial substring of the current match ("AP" stands for "all prefixes"). For example, if the previous match is "wiki" and current match is "pedia", then the LZAP encoder adds 5 new sequences to the dictionary: "wikip", "wikipe", "wikiped", "wikipedi", and "wikipedia", where the LZMW encoder adds only the one sequence "wikipedia". This eliminates some of the complexity of LZMW, at the price of adding more dictionary entries. * LZWL is a syllable-based variant of LZW.


See also

* LZ77 and LZ78 * LZMA *
Lempel–Ziv–Storer–Szymanski Lempel–Ziv–Storer–Szymanski (LZSS) is a lossless data compression algorithm, a derivative of LZ77, that was created in 1982 by James A. Storer and Thomas Szymanski. LZSS was described in article "Data compression via textual substitution" p ...
*
LZJB LZJB is a lossless data compression algorithm invented by Jeff Bonwick to compress crash dumps and data in ZFS. The software is CDDL license licensed. It includes a number of improvements to the LZRW1 algorithm, a member of the Lempel–Ziv L ...
*
Context tree weighting The context tree weighting method (CTW) is a lossless compression and prediction algorithm by . The CTW algorithm is among the very few such algorithms that offer both theoretical guarantees and good practical performance (see, e.g. ). The CTW algor ...
* Discrete cosine transform (DCT), a
lossy compression In information technology, lossy compression or irreversible compression is the class of data compression methods that uses inexact approximations and partial data discarding to represent the content. These techniques are used to reduce data si ...
algorithm used in
JPEG JPEG ( ) is a commonly used method of lossy compression for digital images, particularly for those images produced by digital photography. The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and imag ...
and
MPEG The Moving Picture Experts Group (MPEG) is an alliance of working groups established jointly by ISO and IEC that sets standards for media coding, including compression coding of audio, video, graphics, and genomic data; and transmission and f ...
coding standards


References


External links


Rosettacode wiki, algorithm in various languages
* , Terry A. Welch, ''High speed data compression and decompression apparatus and method''
SharpLZW – C# open source implementation
* MIT OpenCourseWare
Lecture including LZW algorithm


* ttps://www.hanshq.net/zip2.html Shrink, Reduce, and Implode: The Legacy Zip Compression Methodsexplains LZW and how it was used in
PKZIP PKZIP is a file archiving computer program, notable for introducing the popular ZIP file format. PKZIP was first introduced for MS-DOS on the IBM-PC compatible platform in 1989. Since then versions have been released for a number of other ...
{{DEFAULTSORT:Lempel-Ziv-Welch Lossless compression algorithms Articles with example pseudocode Computer-related introductions in 1984 Discovery and invention controversies