Incremental Encoding
   HOME

TheInfoList



OR:

Incremental encoding, also known as front compression, back compression, or front coding, is a type of
delta encoding Delta encoding is a way of storing or transmitting data in the form of '' differences'' (deltas) between sequential data rather than complete files; more generally this is known as data differencing. Delta encoding is sometimes called delta comp ...
compression algorithm In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression ...
whereby common
prefix A prefix is an affix which is placed before the stem of a word. Particularly in the study of languages, a prefix is also called a preformative, because it alters the form of the word to which it is affixed. Prefixes, like other affixes, can b ...
es or
suffixes In linguistics, a suffix is an affix which is placed after the stem of a word. Common examples are case endings, which indicate the grammatical case of nouns and adjectives, and verb endings, which form the conjugation of verbs. Suffixes can ca ...
and their lengths are recorded so that they need not be duplicated. This algorithm is particularly well-suited for compressing sorted
data Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...
, e.g., a list of
word A word is a basic element of language that carries semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguist ...
s from a
dictionary A dictionary is a listing of lexemes from the lexicon of one or more specific languages, often arranged Alphabetical order, alphabetically (or by Semitic root, consonantal root for Semitic languages or radical-and-stroke sorting, radical an ...
. For example: The encoding used to store the common prefix length itself varies from application to application. Typical techniques are storing the value as a single byte;
delta encoding Delta encoding is a way of storing or transmitting data in the form of '' differences'' (deltas) between sequential data rather than complete files; more generally this is known as data differencing. Delta encoding is sometimes called delta comp ...
, which stores only the change in the common prefix length; and various universal codes. It may be combined with other general
lossless data compression Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits Redundanc ...
techniques such as
entropy encoding In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method ...
and dictionary coders to compress the remaining suffixes.


Applications

Incremental encoding is widely used in information retrieval to compress the lexicons used in search indexes; these list all the words found in all the documents and a pointer for each one to a list of locations. Typically, it compresses these indexes by about 40%.Ian H. Witten, Alistair Moffat, Timothy C. Bell. Managing Gigabytes. Second edition. Academic Press. . Section 4.1: Accessing the lexicon, subsection Front coding, pp.159–161. As one example, incremental encoding is used as a starting point by the
GNU locate GNU ( ) is an extensive collection of free software (394 packages ), which can be used as an operating system or can be used in parts with other operating systems. The use of the completed GNU tools led to the family of operating systems popu ...
utility, in an index of filenames and directories. The
GNU locate GNU ( ) is an extensive collection of free software (394 packages ), which can be used as an operating system or can be used in parts with other operating systems. The use of the completed GNU tools led to the family of operating systems popu ...
utility further uses
bigram A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an ''n''-gram for ''n''=2. The frequency distribution of every bigram in a string is commonly used f ...
encoding to further shorten popular filepath prefixes.


References

Lossless compression algorithms Database index techniques Data compression {{storage-software-stub