Snappy (previously known as Zippy) is a fast
data compression
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...
and
decompression library written in
C++ by Google based on ideas from
LZ77 and open-sourced in 2011. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. Compression speed is 250
MB/s and decompression speed is 500 MB/s using a single core of a circa 2011
"Westmere" 2.26 GHz Core i7 processor running in
64-bit mode. The
compression ratio
The compression ratio is the ratio between the volume of the cylinder and combustion chamber in an internal combustion engine at their maximum and minimum values.
A fundamental specification for such engines, it is measured two ways: the stat ...
is 20–100% lower than
gzip
gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and ...
.
Snappy is widely used in Google projects like
Bigtable,
MapReduce and in compressing data for Google's internal
RPC
RPC may refer to:
Science and technology
* Rational polynomial coefficient
* Reactive Plastic Curtain, a carbon-dioxide-absorbing device used in some rebreather breathing sets
* Regional Playback Control, a regional lockout technology for DVDs
* ...
systems. It can be used in open-source projects like
MariaDB ColumnStore,
Cassandra
Cassandra or Kassandra (; Ancient Greek: Κασσάνδρα, , also , and sometimes referred to as Alexandra) in Greek mythology was a Trojan priestess dedicated to the god Apollo and fated by him to utter true prophecies but never to be believe ...
,
Couchbase,
Hadoop
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...
,
LevelDB,
MongoDB
MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is developed by MongoDB Inc. and licensed under the Ser ...
,
RocksDB,
Lucene
Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as ...
,
Spark, and
InfluxDB
InfluxDB is an open-source time series database (TSDB) developed by the company InfluxData. It is written in the Go programming language for storage and retrieval of time series data in fields such as operations monitoring, application metr ...
. Decompression is tested to detect any errors in the compressed stream. Snappy does not use
inline assembler In computer programming, an inline assembler is a feature of some compilers that allows low-level code written in assembly language to be embedded within a program, among code that otherwise has been compiled from a higher-level language such as ...
(except some optimizations) and is portable.
Stream format
Snappy encoding is not bit-oriented, but byte-oriented (only whole bytes are emitted or consumed from a stream). The format uses no
entropy encoder, like
Huffman tree or
arithmetic encoder.
The first bytes of the stream are the length of uncompressed data, stored as a
little-endian
In computing, endianness, also known as byte sex, is the order or sequence of bytes of a word of digital data in computer memory. Endianness is primarily expressed as big-endian (BE) or little-endian (LE). A big-endian system stores the most ...
varint,
which allows for use of a
variable-length code. The lower seven bits of each byte are used for data and the high bit is a flag to indicate the end of the length field.
The remaining bytes in the stream are encoded using one of four element types. The element type is encoded in the lower two bits of the first byte (''tag byte'') of the element:
* 00 – ''Literal'' – uncompressed data; upper 6 bits are used to store length (len-1) of data. Lengths larger than 60 are stored in a 1-4 byte integer indicated by a 6 bit length of 60 (1 byte) to 63 (4 bytes).
* 01 – Copy with length stored as 3 bits and offset stored as 11 bits; one byte after tag byte is used for part of offset;
* 10 – Copy with length stored as 6 bits of tag byte and offset stored as two-byte integer after the tag byte;
* 11 – Copy with length stored as 6 bits of tag byte and offset stored as four-byte little-endian integer after the tag byte;
The copy refers to the dictionary (just-decompressed data). The offset is the shift from the current position back to the already decompressed stream. The length is the number of bytes to copy from the dictionary. The size of the dictionary was limited by the 1.0 Snappy compressor to 32,768 bytes, and updated to 65,536 in version 1.1.
The complete official description of the snappy format can be found in the google GitHub repository.
Example of a compressed stream
The text
may be compressed to this, shown as hex data with explanations:
0000000: ''ca02'' ''f042'' 5769 6b69 7065 6469 6120 6973 ...BWikipedia is
The first 2 bytes, ''ca02'' are the length, as a little-endian varint (see
Protocol Buffers for the varint specification).
[ Thus the most-significant byte is '02' . 0x02ca(varint) = 0x014a = 330 bytes.
The next two bytes, 0xf042, indicate that a literal of 66+1 bytes follows
0000010: 2061 2066 7265 652c 2077 6562 2d62 6173 a free, web-bas
0000020: 6564 2c20 636f 6c6c 6162 6f72 6174 6976 ed, collaborativ
0000030: 652c 206d 756c 7469 6c69 6e67 7561 6c20 e, multilingual
0000040: 656e 6379 636c 6f''09'' 3f''f0 14''70 726f 6a65 encyclo.?..proje
0x09 is tag-byte of type 01 with length - 4 = 0102 = 210 and offset = 0x03f = 63 or "pedia ";]
0xf014 is a literal with length of 20+1 bytes
0000050: 6374 2e00 0000 0000 0000 0000 0000 0000 ct.
In this example, all common substrings with four or more characters were eliminated by the compression process. More common compressors can compress this better. Unlike compression methods such as gzip and bzip2, there is no entropy encoding used to pack alphabet into the bit stream.
Interfaces
Snappy distributions include C++ and C bindings. Third party-provided bindings and ports include C#, Common Lisp
Common Lisp (CL) is a dialect of the Lisp programming language, published in ANSI standard document ''ANSI INCITS 226-1994 (S20018)'' (formerly ''X3.226-1994 (R1999)''). The Common Lisp HyperSpec, a hyperlinked HTML version, has been derived fr ...
, Crystal (programming language)
Crystal is a general-purpose, object-oriented programming language, designed and developed by Ary Borenszweig, Juan Wajnerman, Brian Cardiff and more than 300 contributors. With syntax inspired by the language Ruby, it is a compiled language w ...
, Erlang, Go, Haskell, Lua
Lua or LUA may refer to:
Science and technology
* Lua (programming language)
* Latvia University of Agriculture
* Last universal ancestor, in evolution
Ethnicity and language
* Lua people, of Laos
* Lawa people, of Thailand sometimes referred t ...
, Java
Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
, Nim
Nim is a mathematical two player game.
Nim or NIM may also refer to:
* Nim (programming language)
* Nim Chimpsky, a signing chimpanzee Acronyms
* Network Installation Manager, an IBM framework
* Nuclear Instrumentation Module
* Negative index met ...
, Node.js, Perl
Perl is a family of two High-level programming language, high-level, General-purpose programming language, general-purpose, Interpreter (computing), interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it ...
, PHP
PHP is a General-purpose programming language, general-purpose scripting language geared toward web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementati ...
, Python, R, Ruby
A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum (aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sapp ...
, Rust
Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH), ...
, Smalltalk
Smalltalk is an object-oriented, dynamically typed reflective programming language. It was designed and created in part for educational use, specifically for constructionist learning, at the Learning Research Group (LRG) of Xerox PARC by ...
, and OpenCL
OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-prog ...
.
See also
* Zstandard
References
External links
Snappy mailing list
{{Archive formats
Archive formats
Cross-platform free software
Free data compression software
Free software programmed in C++
C++ libraries