Snappy (previously known as Zippy) is a fast

data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...

and decompression library written in

C++ C++ (, pronounced "C plus plus" and sometimes abbreviated as CPP or CXX) is a high-level, general-purpose programming language created by Danish computer scientist Bjarne Stroustrup. First released in 1985 as an extension of the C programmin ...

by Google based on ideas from

LZ77 LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as Lempel-Ziv 1 (LZ1) and Lempel-Ziv 2 (LZ2) respectively. These two algorithms form the basis ...

and open-sourced in 2011. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. Compression speed is 250

MB/s In telecommunications, data transfer rate is the average number of bits ( bitrate), characters or symbols ( baudrate), or data blocks per unit time passing through a communication link in a data-transmission system. Common data rate units are mu ...

and decompression speed is 500 MB/s using a single core of a circa 2011 "Westmere" 2.26 GHz Core i7 processor running in 64-bit mode. The

compression ratio The compression ratio is the ratio between the maximum and minimum volume during the compression stage of the power cycle in a piston or Wankel engine. A fundamental specification for such engines, it can be measured in two different ways. Th ...

is 20–100% lower than

gzip gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and ...

. Snappy is widely used in Google projects like

Bigtable Bigtable is a fully managed wide-column and key-value NoSQL database service for large analytical and operational workloads as part of the Google Cloud portfolio. History Bigtable development began in 2004.. It is now used by a number of Goo ...

MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filte ...

and in compressing data for Google's internal

RPC RPC may refer to: Science and technology * Rational polynomial coefficient * Reactive Plastic Curtain, a carbon-dioxide-absorbing device used in some rebreather breathing sets * Regional Playback Control, a regional lockout technology for DVDs ...

systems. It can be used in open-source projects like MariaDB ColumnStore,

Cassandra Cassandra or Kassandra (; , , sometimes referred to as Alexandra; ) in Greek mythology was a Trojan priestess dedicated to the god Apollo and fated by him to utter true prophecy, prophecies but never to be believed. In modern usage her name is e ...

Couchbase Couchbase Server, originally known as Membase, is a source-available, distributed ( shared-nothing architecture) multi-model NoSQL document-oriented database software package optimized for interactive applications. These applications may serv ...

Hadoop Apache Hadoop () is a collection of Open-source software, open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for Clustered file system, distributed storage and processing of big data usin ...

LevelDB LevelDB is an open-source on-disk key-value store written by Google fellows Jeffrey Dean and Sanjay Ghemawat. Inspired by Bigtable, LevelDB source code is hosted on GitHub under the New BSD License and has been ported to a variety of Unix-base ...

MongoDB MongoDB is a source-available, cross-platform, document-oriented database program. Classified as a NoSQL database product, MongoDB uses JSON-like documents with optional database schema, schemas. Released in February 2009 by 10gen (now MongoDB ...

RocksDB RocksDB is a high performance embedded database for key-value data. It is a fork of Google's LevelDB optimized to exploit multi-core processors (CPUs), and make efficient use of fast storage, such as solid-state drives (SSD), for input/outpu ...

Lucene Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a ...

, Spark,

InfluxDB InfluxDB is a time series database (TSDB) developed by the company InfluxData. It is used for storage and retrieval of time series data in fields such as operations monitoring, application metrics, Internet of Things sensor data, and real-time a ...

, and Ceph.

Firefox Mozilla Firefox, or simply Firefox, is a free and open-source web browser developed by the Mozilla Foundation and its subsidiary, the Mozilla Corporation. It uses the Gecko rendering engine to display web pages, which implements curr ...

uses Snappy to compress data in

localStorage Web storage, formerly known as DOM storage (Document Object Model storage), is a standard JavaScript API provided by web browsers. It enables websites to store persistent data on users' devices similar to cookies, but with much larger capacity ...

. Decompression is tested to detect any errors in the compressed stream. Snappy does not use

inline assembler In computer programming, an inline assembler is a feature of some compilers that allows low-level code written in assembly language to be embedded within a program, among code that otherwise has been compiled from a high-level language, higher-leve ...

(except some optimizations) and is portable.

Stream format

Snappy encoding is not bit-oriented, but byte-oriented (only whole bytes are emitted or consumed from a stream). The format uses no entropy encoder, like

Huffman coding In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by ...

arithmetic coding Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a String (computer science), string of characters is represented using a fixed number of bits per character, as in the American Standard Code for In ...

. The first bytes of the stream are the length of uncompressed data, stored as a

little-endian '' Jonathan_Swift.html" ;"title="Gulliver's Travels'' by Jonathan Swift">Gulliver's Travels'' by Jonathan Swift, the novel from which the term was coined In computing, endianness is the order in which bytes within a word (data type), word of d ...

varint, which allows for use of a

variable-length code In coding theory, a variable-length code is a code which maps source symbols to a ''variable'' number of bits. The equivalent concept in computer science is '' bit string''. Variable-length codes can allow sources to be compressed and decompr ...

. The lower seven bits of each byte are used for data and the high bit is a flag to indicate the end of the length field. The remaining bytes in the stream are encoded using one of four element types. The element type is encoded in the lower two bits of the first byte (''tag byte'') of the element: * 00 – ''Literal'' – uncompressed data; upper 6 bits are used to store length (len-1) of data. Lengths larger than 60 are stored in a 1-4 byte integer indicated by a 6 bit length of 60 (1 byte) to 63 (4 bytes). * 01 – Copy with length stored as 3 bits and offset stored as 11 bits; one byte after tag byte is used for part of offset; * 10 – Copy with length stored as 6 bits of tag byte and offset stored as two-byte integer after the tag byte; * 11 – Copy with length stored as 6 bits of tag byte and offset stored as four-byte little-endian integer after the tag byte; The copy refers to the dictionary (just-decompressed data). The offset is the shift from the current position back to the already decompressed stream. The length is the number of bytes to copy from the dictionary. The size of the dictionary was limited by the 1.0 Snappy compressor to 32,768 bytes, and updated to 65,536 in version 1.1. The complete official description of the snappy format can be found in the google GitHub repository.

Example of a compressed stream

The text may be compressed to this, shown as hex data with explanations: 000000 51 ''f0 42'' 57 69 6b 69 70 65 64 69 61 20 69 73 20 >Q.BWikipedia is < 000010 61 20 66 72 65 65 2c 20 77 65 62 2d 62 61 73 65 >a free, web-base< 000020 64 2c 20 63 6f 6c 6c 61 62 6f 72 61 74 69 76 65 >d, collaborative< 000030 2c 20 6d 75 6c 74 69 6c 69 6e 67 75 61 6c 20 65 >, multilingual e< The stream starts with the length of the uncompressed data as a varint, so the first byte, with the high bit clear, corresponds to a length of 51₁₆=81 bytes. The first block must be a literal, and ''f042'' corresponds thereto: the first byte is broken down as f0₁₆ ⇒ len−1=111100₂;type=00₂; type 0 signifies a literal, and a length−1 of 111100₂=60 means the length is read from the following byte, in this case 42₁₆=66. The first 66 bytes of the text ("''Wikipedia is a free, web-based, collaborative, multilingual encyclo''") follow. 000040 6e 63 79 63 6c 6f 09 3f ''1c'' 70 72 6f 6a 65 63 74 >ncyclo.?.project< 000050 2e >.< The next block's header consists of 093f, broken down as 09₁₆ ⇒ off_h=000₂,len−4=010₂;type=01₂: type 1 indicates a "copy with 1-byte offset": the length to copy works out to 010₂+4=6 bytes, and the offset is an 11-bit integer whose top bits are off_h and whose low bits are the next byte: 3f, so =00000111111₂=63. This means to copy 6 bytes, starting 63 bytes ago – since 67 bytes have already been copied this evaluates to copying 6 bytes starting at position 4 (from the fifth byte), which produces "''pedia ''". This block has no other content, and thus the following block starts immediately after – ''1c''₁₆ ⇒ len−1=000111₂;type=00₂, i.e. a literal of length 000111₂+1=8. The final part of the text ("''project.''") follows. In this example, all common substrings with four or more characters were eliminated by the compression process. More common compressors can compress this better. Unlike compression methods such as gzip and bzip2, there is no

entropy encoding In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method ...

used to pack alphabet into the bit stream.

Framing format

The Snappy stream supports inputs with an overall size of up to 4GiB−1, and may add significant overhead to sections which are not or insufficiently compressed, as well as not being self-identifying, and having no data integrity mechanism beyond a simple output size check. To combat these issues, the Snappy framing format "Snappy framed" may be used, which breaks the input into chunks of up to 64KiB, delimited by 4-byte block headers (a one-byte identifier and three-byte length): * the "Stream identifier", with type FF₁₆, must start the stream, and must consist exclusively of "sNaPpY" in

ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...

, * the "Compressed data", with type 0, contains a compressed Snappy stream, * the "Uncompressed data", with type 1, contains data to copy to the output verbatim. Both types of data chunk also contain a

CRC-32C A cyclic redundancy check (CRC) is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to digital data. Blocks of data entering these systems get a short ''check value'' attached, based on th ...

checksum of the uncompressed data. Chunks of types 2-7F₁₆ are reserved and must result in errors. Those of types 80₁₆-FE₁₆ may be ignored by the decompressors which do not understand them.

Interfaces

Snappy distributions include C++ and C bindings. Third party-provided bindings and ports include C#,

Common Lisp Common Lisp (CL) is a dialect of the Lisp programming language, published in American National Standards Institute (ANSI) standard document ''ANSI INCITS 226-1994 (S2018)'' (formerly ''X3.226-1994 (R1999)''). The Common Lisp HyperSpec, a hyperli ...

Crystal (programming language) Crystal is a high-level general-purpose, object-oriented programming language, designed and developed by Ary Borenszweig, Juan Wajnerman, Brian Cardiff and more than 400 contributors. With syntax inspired by the language Ruby, it is a compiled ...

, Erlang, Go,

Haskell Haskell () is a general-purpose, statically typed, purely functional programming language with type inference and lazy evaluation. Designed for teaching, research, and industrial applications, Haskell pioneered several programming language ...

Lua Lua is a lightweight, high-level, multi-paradigm programming language designed mainly for embedded use in applications. Lua is cross-platform software, since the interpreter of compiled bytecode is written in ANSI C, and Lua has a relatively ...

Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...

Nim Nim is a mathematical combinatorial game in which two players take turns removing (or "nimming") objects from distinct heaps or piles. On each turn, a player must remove at least one object, and may remove any number of objects provided they all ...

Node.js Node.js is a cross-platform, open-source JavaScript runtime environment that can run on Windows, Linux, Unix, macOS, and more. Node.js runs on the V8 JavaScript engine, and executes JavaScript code outside a web browser. Node.js lets develope ...

Perl Perl is a high-level, general-purpose, interpreted, dynamic programming language. Though Perl is not officially an acronym, there are various backronyms in use, including "Practical Extraction and Reporting Language". Perl was developed ...

PHP PHP is a general-purpose scripting language geared towards web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by the PHP Group. ...

Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...

, R,

Ruby Ruby is a pinkish-red-to-blood-red-colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sapph ...

Rust Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH) ...

Smalltalk Smalltalk is a purely object oriented programming language (OOP) that was originally created in the 1970s for educational use, specifically for constructionist learning, but later found use in business. It was created at Xerox PARC by Learni ...

, and

OpenCL OpenCL (Open Computing Language) is a software framework, framework for writing programs that execute across heterogeneous computing, heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), di ...

. A

command-line interface A command-line interface (CLI) is a means of interacting with software via command (computing), commands each formatted as a line of text. Command-line interfaces emerged in the mid-1960s, on computer terminals, as an interactive and more user ...

program is also available.

References

External links