Incremental encoding, also known as front compression, back compression, or front coding, is a type of
delta encoding compression algorithm whereby common
prefix
A prefix is an affix which is placed before the Word stem, stem of a word. Adding it to the beginning of one word changes it into another word. For example, when the prefix ''un-'' is added to the word ''happy'', it creates the word ''unhappy'' ...
es or
suffixes and their lengths are recorded so that they need not be duplicated. This algorithm is particularly well-suited for compressing
sorted data, e.g., a list of
words from a
dictionary
A dictionary is a listing of lexemes from the lexicon of one or more specific languages, often arranged alphabetically (or by radical and stroke for ideographic languages), which may include information on definitions, usage, etymologies ...
.
For example:
The encoding used to store the common prefix length itself varies from application to application. Typical techniques are storing the value as a single byte;
delta encoding, which stores only the change in the common prefix length; and various
universal codes. It may be combined with other general
lossless data compression techniques such as
entropy encoding and
dictionary coder
A dictionary coder, also sometimes known as a substitution coder, is a class of lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in a data structure (called ...
s to compress the remaining suffixes.
Applications
Incremental encoding is widely used in information retrieval to compress the lexicons used in
search indexes; these list all the words found in all the documents and a pointer for each one to a list of locations. Typically, it compresses these indexes by about 40%.
[Ian H. Witten, Alistair Moffat, Timothy C. Bell. Managing Gigabytes. Second edition. Academic Press. . Section 4.1: Accessing the lexicon, subsection Front coding, pp.159–161.]
As one example, incremental encoding is used as a starting point by the
GNU locate
locate is a Unix utility which serves to find files on filesystems. It searches through a prebuilt database of files generated by the updatedb command or by a daemon and compressed using incremental encoding. It operates significantly faster than ...
utility, in an index of filenames and directories. The
GNU locate
locate is a Unix utility which serves to find files on filesystems. It searches through a prebuilt database of files generated by the updatedb command or by a daemon and compressed using incremental encoding. It operates significantly faster than ...
utility further uses
bigram encoding to further shorten popular filepath prefixes.
References
Lossless compression algorithms
Database index techniques
{{storage-software-stub