Byte pair encoding
   HOME

TheInfoList



OR:

Byte pair encoding or digram coding is a simple form of
data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressio ...
in which the most common pair of consecutive
byte The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable uni ...
s of data is replaced with a byte that does not occur within that data. A table of the replacements is required to rebuild the original data. The algorithm was first described publicly by Philip Gage in a February 1994 article "A New Algorithm for Data Compression" in the ''C Users Journal''. A variant of the technique has shown to be useful in several
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
(NLP) applications, such as
Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
's SentencePiece, and
OpenAI OpenAI is an artificial intelligence (AI) research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. The company conducts research in the field of AI with the stated goal of promo ...
's
GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt. The architecture is a standar ...
. Here, the goal is not data compression, but encoding text in a given language as a sequence of 'tokens', using a fixed vocabulary of different tokens. Typically, most words will be encoded as a single token, while rare words will be encoded as a sequence of a few tokens, where these tokens represent meaningful word parts. This translation of text into tokens can be found by a variant of byte pair encoding.


Byte pair encoding example

Suppose the data to be encoded is aaabdaaabac The byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, "Z". Now there is the following data and replacement table: ZabdZabac Z=aa Then the process is repeated with byte pair "ab", replacing it with Y: ZYdZYac Y=ab Z=aa The only literal byte pair left occurs only once, and the encoding might stop here. Or the process could continue with
recursive Recursion (adjective: ''recursive'') occurs when a thing is defined in terms of itself or of its type. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in mathematics ...
byte pair encoding, replacing "ZY" with "X": XdXac X=ZY Y=ab Z=aa This data cannot be compressed further by byte pair encoding because there are no pairs of bytes that occur more than once. To decompress the data, simply perform the replacements in the reverse order.


See also

*
Re-Pair Re-Pair (short for Recursive Pairing) is a grammar-based compression algorithm that, given an input text, builds a straight-line program, i.e. a context-free grammar In formal language theory, a context-free grammar (CFG) is a formal grammar ...
*
Sequitur algorithm Sequitur (or Nevill-Manning algorithm) is a recursive algorithm developed by Craig Nevill-Manning and Ian H. Witten in 1997 that infers a hierarchical structure (context-free grammar) from a sequence of discrete symbols. The algorithm operates in ...


References

{{DEFAULTSORT:Byte Pair Encoding Lossless compression algorithms