7z (file format)
   HOME

TheInfoList



OR:

7z is a compressed
archive file format In computing, an archive file is a computer file that is composed of one or more files along with metadata. Archive files are used to collect multiple data files together into a single file for easier portability and storage, or simply to compres ...
that supports several different
data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressio ...
,
encryption In cryptography, encryption is the process of encoding information. This process converts the original representation of the information, known as plaintext, into an alternative form known as ciphertext. Ideally, only authorized parties can de ...
and pre-processing algorithms. The 7z format initially appeared as implemented by the
7-Zip 7-Zip is a free and open-source file archiver, a utility used to place groups of files within compressed containers known as "archives". It is developed by Igor Pavlov and was first released in 1999. 7-Zip has its own archive format called 7z, ...
archiver. The 7-Zip program is publicly available under the terms of the
GNU Lesser General Public License The GNU Lesser General Public License (LGPL) is a free-software license published by the Free Software Foundation (FSF). The license allows developers and companies to use and integrate a software component released under the LGPL into their own ...
. The LZMA SDK 4.62 was placed in the
public domain The public domain (PD) consists of all the creative work to which no exclusive intellectual property rights apply. Those rights may have expired, been forfeited, expressly waived, or may be inapplicable. Because those rights have expired, ...
in December 2008. The latest stable version of 7-Zip and LZMA SDK is version 22.01. The official, informal 7z file format specification is distributed with 7-Zip's source code since 2015. The specification can be found in plain text format in the 'doc' sub-directory of the source code distribution. There have been additional third-party attempts at writing more concrete documentation based on the released code.


Features and enhancements

The 7z format provides the following main features: *
Open Open or OPEN may refer to: Music * Open (band), Australian pop/rock band * The Open (band), English indie rock band * ''Open'' (Blues Image album), 1969 * ''Open'' (Gotthard album), 1999 * ''Open'' (Cowboy Junkies album), 2001 * ''Open'' ( ...
, modular architecture that allows any compression, conversion, or encryption method to be stacked. * High compression ratios (depending on the compression method used). * AES-256 bit
encryption In cryptography, encryption is the process of encoding information. This process converts the original representation of the information, known as plaintext, into an alternative form known as ciphertext. Ideally, only authorized parties can de ...
. * Zip 2.0 (Legacy) Encryption * Large file support (up to approximately 16
exbibyte The byte is a unit of digital information that most commonly consists of eight bit The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a ...
s, or 264 bytes). *
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
file names. * Support for
solid compression In computing, solid compression is a method for data compression of multiple files, wherein all the uncompressed files are concatenated and treated as a single data block. Such an archive is called a solid archive. It is used natively in the 7z a ...
, where multiple files of like type are compressed within a single stream, in order to exploit the combined redundancy inherent in similar files. * Compression and encryption of archive headers. * Support for multi-part archives : e.g. xxx.7z.001, xxx.7z.002, ... (see the context menu items ''Split File...'' to create them and ''Combine Files...'' to re-assemble an archive from a set of multi-part component files). * Support for custom codec plugin DLLs. The format's
open architecture Open architecture is a type of computer architecture or software architecture intended to make adding, upgrading, and swapping components with other computers easy. For example, the IBM PC, Amiga 500 and Apple IIe have an open architecture supp ...
allows additional future compression methods to be added to the standard.


Compression methods

The following compression methods are currently defined: * LZMA – A variation of the
LZ77 LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for many variations includin ...
algorithm, using a sliding dictionary up to 4 GB in length for duplicate string elimination. The LZ stage is followed by
entropy coding In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method ...
using a Markov chain-based range coder and binary trees. * LZMA2 – modified version of LZMA providing better multithreading support and less expansion of incompressible data. * Bzip2 – The standard
Burrows–Wheeler transform The Burrows–Wheeler transform (BWT, also called block-sorting compression) rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated c ...
algorithm. Bzip2 uses two reversible transformations; BWT, then
Move to front The move-to-front (MTF) transform is an encoding of data (typically a stream of bytes) designed to improve the performance of entropy encoding techniques of compression. When efficiently implemented, it is fast enough that its benefits usually ...
with
Huffman coding In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algo ...
for symbol reduction (the actual compression element). * PPMd – Dmitry Shkarin's 2002 PPMdH (PPMII (Prediction by Partial matching with Information Inheritance) and cPPMII (complicated PPMII)) with small changes: PPMII is an improved version of the 1984
PPM compression algorithm Prediction by partial matching (PPM) is an adaptive statistical data compression technique based on context modeling and prediction. PPM models use a set of previous symbols in the uncompressed symbol stream to predict the next symbol in the strea ...
(prediction by partial matching). * DEFLATE – Standard algorithm based on 32 kB
LZ77 LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for many variations includin ...
and
Huffman coding In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algo ...
. Deflate is found in several file formats including ZIP,
gzip gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and i ...
, PNG and PDF. 7-Zip contains a from-scratch DEFLATE encoder that frequently beats the ''de facto'' standard
zlib zlib ( or "zeta-lib", ) is a software library used for data compression. zlib was written by Jean-loup Gailly and Mark Adler and is an abstraction of the DEFLATE compression algorithm used in their gzip file compression program. zlib is also ...
version in compression size, but at the expense of CPU usage. A suite of recompression tools called AdvanceCOMP contains a copy of the DEFLATE encoder from the 7-Zip implementation; these utilities can often be used to further compress the size of existing
gzip gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and i ...
, ZIP, PNG, or MNG files.


Pre-processing filters

The LZMA SDK comes with the BCJ and BCJ2 preprocessors included, so that later stages are able to achieve greater compression: For
x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was intr ...
,
ARM In human anatomy, the arm refers to the upper limb in common usage, although academically the term specifically means the upper arm between the glenohumeral joint (shoulder joint) and the elbow joint. The distal part of the upper limb between th ...
, PowerPC (PPC), IA-64
Itanium Itanium ( ) is a discontinued family of 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). Launched in June 2001, Intel marketed the processors for enterprise servers and high-performance comput ...
, and
ARM Thumb ARM (stylised in lowercase as arm, formerly an acronym for Advanced RISC Machines and originally Acorn RISC Machine) is a family of reduced instruction set computer (RISC) instruction set architectures for computer processors, configured ...
processors, jump targets are 'normalized' before compression by changing relative position into absolute values. For x86, this means that near jumps, calls and conditional jumps (but not short jumps and conditional jumps) are converted from the machine language "jump 1655 bytes backwards" style notation to normalized "jump to address 5554" style notation; all jumps to 5554, perhaps a common subroutine, are thus encoded identically, making them more compressible. * BCJ – Converter for 32-bit x86 executables. Normalise target addresses of near jumps and calls from relative distances to absolute destinations. *BCJ2– Pre-processor for 32-bit x86 executables. BCJ2 is an improvement on BCJ, adding additional x86 jump/call instruction processing. Near jump, near call, conditional near jump targets are split out and compressed separately in another stream. *
Delta encoding Delta encoding is a way of storing or transmitting data in the form of '' differences'' (deltas) between sequential data rather than complete files; more generally this is known as data differencing. Delta encoding is sometimes called delta compre ...
 – delta filter, basic preprocessor for multimedia data. Similar executable pre-processing technology is included in other software; the RAR compressor features displacement compression for 32-bit x86 executables and IA-64 executables, and the UPX runtime executable file compressor includes support for working with 16-bit values within
DOS DOS is shorthand for the MS-DOS and IBM PC DOS family of operating systems. DOS may also refer to: Computing * Data over signalling (DoS), multiplexing data onto a signalling channel * Denial-of-service attack (DoS), an attack on a communicat ...
binary files.


Encryption

The 7z format supports
encryption In cryptography, encryption is the process of encoding information. This process converts the original representation of the information, known as plaintext, into an alternative form known as ciphertext. Ideally, only authorized parties can de ...
with the AES algorithm with a 256-bit key. The key is generated from a user-supplied
passphrase A passphrase is a sequence of words or other text used to control access to a computer system, program or data. It is similar to a password in usage, but a passphrase is generally longer for added security. Passphrases are often used to control ...
using an algorithm based on the
SHA-256 SHA-2 (Secure Hash Algorithm 2) is a set of cryptographic hash functions designed by the United States National Security Agency (NSA) and first published in 2001. They are built using the Merkle–Damgård construction, from a one-way compressi ...
hash function. The SHA-256 is executed 218 (262144) times, which causes a significant delay on slow PCs before compression or extraction starts. This technique is called
key stretching In cryptography, key stretching techniques are used to make a possibly weak key, typically a password or passphrase, more secure against a brute-force attack by increasing the resources (time and possibly space) it takes to test each possible ke ...
and is used to make a brute-force search for the passphrase more difficult. Current GPU-based, and custom hardware attacks limit the effectiveness of this particular method of key stretching,
Colin Percival Colin A. Percival (born 1980) is a Canadian computer scientist and computer security researcher. He completed his undergraduate education at Simon Fraser University and a doctorate at the University of Oxford. While at university he joined the ...

scrypt
As presented i
"Stronger Key Derivation via Sequential Memory-Hard Functions"
presented at BSDCan'09, May 2009.
so it is still important to choose a strong password. The 7z format provides the option to encrypt the filenames of a 7z archive.


Limitations

The 7z format does not store
filesystem permissions Most file systems include attributes of files and directories that control the ability of users to read, change, navigate, and execute the contents of the file system. In some cases, menu options or functions may be made visible or hidden dependin ...
(such as
UNIX Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, an ...
owner/group permissions or
NTFS New Technology File System (NTFS) is a proprietary journaling file system developed by Microsoft. Starting with Windows NT 3.1, it is the default file system of the Windows NT family. It superseded File Allocation Table (FAT) as the preferred fil ...
ACLs), and hence can be inappropriate for backup/archival purposes. A workaround on UNIX-like systems for this is to convert data to a tar bitstream before compressing with 7z. But it is worth noting that GNU tar (common in many UNIX environments) can also compress with the LZMA2 algorithm (" xz") natively, without the use of 7z, using the "-J" switch. The resulting file extension is ".tar.xz" or ".txz" and not ".tar.7z". This method of compression has been adopted with many distributions for packaging, such as Arch, Debian (deb), Fedora (rpm) and Slackware. (The older "lzma" format is less efficient.) On the other hand, it is important to note, that tar does not save the filesystem encoding, which means that tar compressed filenames can become unreadable if decompressed on a different computer. The 7z format does not allow extraction of some "broken files"—that is (for example) if one has the first segment of a series of 7z files, 7z cannot give the start of the files within the archive—it must wait until all segments are downloaded. The 7z format also lacks recovery records, making it vulnerable to
data degradation Data degradation is the gradual corruption of computer data due to an accumulation of non-critical failures in a data storage device. The phenomenon is also known as data decay, data rot or bit rot. Example Below are several digital images ill ...
unless used in conjunction with external solutions, like parchives, or within
filesystems In computing, file system or filesystem (often abbreviated to fs) is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one larg ...
with robust
error-correction In information theory and coding theory with applications in computer science and telecommunication, error detection and correction (EDAC) or error control are techniques that enable reliable delivery of digital data over unreliable communica ...
. By way of comparison, zip files also lack a recovery feature while the rar format has one.


See also

*
Comparison of archive formats This is a list of file formats used by archivers and compressors used to create archive files. Archiving only Compression only Archiving and compression Data recovery Comparison Containers and compression Notes While the origin ...
*
List of archive formats This is a list of file formats used by archivers and compressors used to create archive files. Archiving only Compression only Archiving and compression Data recovery Comparison Containers and compression Notes While the original ...
*
Open format An open file format is a file format for storing digital data, defined by an openly published specification usually maintained by a standards organization, and which can be used and implemented by anyone. Open file format is licensed with open li ...


References


Further reading

*


External links

* * {{Archive formats Computer-related introductions in 1999 Archive formats Russian inventions