Btrfs Qgroup Screenshot
   HOME

TheInfoList



OR:

Btrfs (pronounced as "better F S", "butter F S", "b-tree F S", or "B.T.R.F.S.") is a computer storage format that combines a file system based on the
copy-on-write Copy-on-write (COW), also called implicit sharing or shadowing, is a resource-management technique used in programming to manage shared data efficiently. Instead of copying data right away when multiple programs use it, the same data is shared ...
(COW) principle with a
logical volume manager In computer storage, logical volume management or LVM provides a method of allocating space on mass-storage devices that is more flexible than conventional partitioning schemes to store volumes. In particular, a volume manager can concatenate, ...
(distinct from Linux's LVM), developed together. It was created by Chris Mason in 2007 for use in
Linux Linux ( ) is a family of open source Unix-like operating systems based on the Linux kernel, an kernel (operating system), operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically package manager, pac ...
, and since November 2013, the file system's on-disk format has been declared stable in the Linux
kernel Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learnin ...
. Btrfs is intended to address the lack of pooling,
snapshots Snapshot, snapshots or snap shot may refer to: * Snapshot (photography), a photograph taken without preparation Computing * Snapshot (computer storage), the state of a system at a particular point in time * Snapshot (file format) or SNP, a file ...
, integrity checking,
data scrubbing Data scrubbing is an error correction technique that uses a background task to periodically inspect main memory or storage for errors, then corrects detected errors using redundant data in the form of different checksums or copies of data. Data ...
, and integral multi-device spanning in Linux file systems. Mason, the principal Btrfs author, stated that its goal was "to let inuxscale for the storage that will be available. Scaling is not just about addressing the storage but also means being able to administer and to manage it with a clean interface that lets people see what's being used and makes it more reliable".


History

The core data structure of Btrfsthe copy-on-write
B-tree In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing fo ...
was originally proposed by
IBM International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
researcher Ohad Rodeh at a
USENIX USENIX is an American 501(c)(3) nonprofit membership organization based in Berkeley, California and founded in 1975 that supports advanced computing systems, operating system (OS), and computer networking research. It organizes several confe ...
conference in 2007. Mason, an engineer working on
ReiserFS ReiserFS is a general-purpose, journaling file system initially designed and implemented by a team at Namesys led by Hans Reiser and licensed under GPLv2. Introduced in version 2.4.1 of the Linux kernel, it was the first journaling file syst ...
for SUSE at the time, joined Oracle later that year and began work on a new file system based on these B-trees. In 2008, the principal developer of the
ext3 ext3, or third extended filesystem, is a journaling file system, journaled file system that is commonly used with the Linux kernel. It used to be the default file system for many popular Linux distributions but generally has been supplanted by ...
and
ext4 ext4 (fourth extended filesystem) is a journaling file system for Linux, developed as the successor to ext3. ext4 was initially a series of backward-compatible extensions to ext3, many of them originally developed by Cluster File Systems for ...
file systems,
Theodore Ts'o Theodore Yue Tak Ts'o (; born 1968) is an American software engineer mainly known for his contributions to the Linux kernel, in particular his contributions to file systems. He is the secondary developer and maintainer of e2fsprogs, the usersp ...
, stated that although ext4 has improved features, it is not a major advance; it uses old technology and is a stop-gap. Ts'o said that Btrfs is the better direction because "it offers improvements in scalability, reliability, and ease of management". Btrfs also has "a number of the same design ideas that reiser3/ 4 had". Btrfs 1.0, with finalized on-disk format, was originally slated for a late-2008 release, and was finally accepted into the
Linux kernel mainline The Linux kernel is a Free and open-source software, free and open source Unix-like kernel (operating system), kernel that is used in many computer systems worldwide. The kernel was created by Linus Torvalds in 1991 and was soon adopted as the k ...
in 2009. Several
Linux distribution A Linux distribution, often abbreviated as distro, is an operating system that includes the Linux kernel for its kernel functionality. Although the name does not imply product distribution per se, a distro—if distributed on its own—is oft ...
s began offering Btrfs as an experimental choice of root file system during installation. In July 2011, Btrfs automatic
defragmentation In the maintenance of file systems, defragmentation is a process that reduces the degree of fragmentation. It does this by physically organizing the contents of the mass storage device used to store files into the smallest number of contiguous ...
and scrubbing features were merged into version 3.0 of the
Linux kernel mainline The Linux kernel is a Free and open-source software, free and open source Unix-like kernel (operating system), kernel that is used in many computer systems worldwide. The kernel was created by Linus Torvalds in 1991 and was soon adopted as the k ...
. Besides Mason at Oracle, Miao Xie at Fujitsu contributed performance improvements. In June 2012, Mason left Oracle for
Fusion-io Fusion-io, Inc. was a computer hardware and software systems company (acquired by SanDisk Corporation in 2014) based in Cottonwood Heights, Utah, that designed and manufactured products using flash memory technology. The Fusion was marketed ...
, which he left a year later with Josef Bacik to join
Facebook Facebook is a social media and social networking service owned by the American technology conglomerate Meta Platforms, Meta. Created in 2004 by Mark Zuckerberg with four other Harvard College students and roommates, Eduardo Saverin, Andre ...
. While at both companies, Mason continued his work on Btrfs. In 2012, two Linux distributions moved Btrfs from experimental to production or supported status:
Oracle Linux Oracle Linux (abbreviated OL, formerly known as Oracle Enterprise Linux or OEL) is a Linux distribution packaged and freely distributed by Oracle, available partially under the GNU General Public License since late 2006. It is, in part, compile ...
in March, followed by
SUSE Linux Enterprise SUSE Linux Enterprise (SLE) is a Linux-based operating system developed by SUSE. It is available in two editions, suffixed with Server (SLES) for servers and mainframes, and Desktop (SLED) for workstations and desktop computers. Its major ve ...
in August. In 2015, Btrfs was adopted as the default filesystem for
SUSE Linux Enterprise Server SUSE Linux Enterprise (SLE) is a Linux-based operating system developed by SUSE. It is available in two editions, suffixed with Server (SLES) for servers and mainframes, and Desktop (SLED) for workstations and desktop computers. Its major ve ...
(SLE) 12. In August 2017, Red Hat announced in the release notes for
Red Hat Enterprise Linux Red Hat Enterprise Linux (RHEL) is a commercial Linux distribution developed by Red Hat. Red Hat Enterprise Linux is released in server versions for x86-64, Power ISA, ARM64, and IBM Z and a desktop version for x86-64. Fedora Linux and ...
(RHEL) 7.4 that it no longer planned to move Btrfs to a fully supported feature (it's been included as a "technology preview" since RHEL 6 beta) noting that it would remain available in the RHEL 7 release series. Btrfs was removed from RHEL 8 in May 2019. RHEL moved from ext4 in RHEL 6 to
XFS XFS is a high-performance 64-bit journaling file system created by Silicon Graphics, Inc (SGI) in 1993. It was the default file system in SGI's IRIX operating system starting with its version 5.3. XFS was ported to the Linux kernel in 2001; a ...
in RHEL 7. In 2020, Btrfs was selected as the default file system for
Fedora A fedora () is a hat with a soft brim and indented crown.Kilgour, Ruth Edwards (1958). ''A Pageant of Hats Ancient and Modern''. R. M. McBride Company. It is typically creased lengthwise down the crown and "pinched" near the front on both sides ...
33 for desktop variants.


Features


List of features


Implemented

As of version 6.0 of the Linux kernel, Btrfs implements the following features: * Mostly self-healing in some configurations due to the nature of copy-on-write * Online defragmentation and an ''autodefrag'' mount option * Online volume growth and shrinking * Online
block device In Unix-like operating systems, a device file, device node, or special file is an interface to a device driver that appears in a file system as if it were an ordinary file. There are also special files in DOS, OS/2, and Windows. These spec ...
addition and removal * Online balancing (movement of objects between block devices to balance load) * Offline filesystem check * Online
data scrubbing Data scrubbing is an error correction technique that uses a background task to periodically inspect main memory or storage for errors, then corrects detected errors using redundant data in the form of different checksums or copies of data. Data ...
for finding errors and automatically fixing them for files with redundant copies *
RAID 0 In computer storage, the standard RAID levels comprise a basic set of RAID ("redundant array of independent disks" or "redundant array of inexpensive disks") configurations that employ the techniques of striping, mirroring, or parity to create la ...
,
RAID 1 In computer storage, the standard RAID levels comprise a basic set of RAID ("redundant array of independent disks" or "redundant array of inexpensive disks") configurations that employ the techniques of striping, mirroring, or parity to create la ...
, and RAID 10 * Subvolumes (one or more separately mountable filesystem roots within each
disk partition Disk partitioning or disk slicing is the creation of one or more regions on Computer data storage#Secondary storage, secondary storage, so that each region can be managed separately. These regions are called partitions. It is typically the first ...
) * Transparent
compression Compression may refer to: Physical science *Compression (physics), size reduction due to forces *Compression member, a structural element such as a column *Compressibility, susceptibility to compression * Gas compression *Compression ratio, of a ...
via
zlib zlib ( or "zeta-lib", ) is a software library used for data compression as well as a data format. zlib was written by Jean-loup Gailly and Mark Adler and is an abstraction of the DEFLATE compression algorithm used in their gzip file compre ...
, LZO and (since 4.14)
ZSTD Zstandard is a lossless data compression algorithm developed by Yann Collet at Facebook. Zstd is the corresponding reference implementation in C, released as open-source software on 31 August 2016. The algorithm was published in 2018 as , wh ...
, configurable per file or volume * Atomic writable (via copy-on-write) or read-only
snapshots Snapshot, snapshots or snap shot may refer to: * Snapshot (photography), a photograph taken without preparation Computing * Snapshot (computer storage), the state of a system at a particular point in time * Snapshot (file format) or SNP, a file ...
of subvolumes * File cloning ( reflink, copy-on-write) via cp --reflink <source file> <destination file> *
Checksum A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify dat ...
s on data and metadata (
CRC-32C A cyclic redundancy check (CRC) is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to digital data. Blocks of data entering these systems get a short ''check value'' attached, based on th ...
). New hash functions are implemented since 5.5:
xxHash This is a list of hash functions, including cyclic redundancy checks, checksum functions, and cryptographic hash functions. Cyclic redundancy checks Adler-32 is often mistaken for a CRC, but it is not: it is a checksum A checksum is a sm ...
,
SHA256 SHA-2 (Secure Hash Algorithm 2) is a set of cryptographic hash functions designed by the United States National Security Agency (NSA) and first published in 2001. They are built using the Merkle–Damgård construction, from a one-way compression ...
,
BLAKE2B BLAKE is a cryptographic hash function based on Daniel J. Bernstein's ChaCha (cipher), ChaCha stream cipher, but a permuted copy of the input block, XORed with round constants, is added before each ChaCha round. Like SHA-2, there are two variants ...
. * In-place conversion from ext3/4 to Btrfs (with rollback). This feature regressed around btrfs-progs version 4.0, rewritten from scratch in 4.6. *
Union mount In computer operating systems, union mounting is a way of combining multiple directories into one that appears to contain their combined contents. Union mounting is supported in Linux, Berkeley Software Distribution, BSD and several of its success ...
ing of read-only storage, known as file system seeding (read-only storage used as a copy-on-write backing for a writable Btrfs) * Block discard (reclaims space on some
virtualized In computing, virtualization (abbreviated v12n) is a series of technologies that allows dividing of physical computing resources into a series of virtual machines, operating systems, processes or containers. Virtualization began in the 1960s with ...
setups and improves
wear leveling Wear leveling (also written as wear levelling) is a technique Wear leveling techniques for flash memory systems. for prolonging the service life of some kinds of erasable computer storage media, such as flash memory, which is used in solid-state d ...
on
SSDs The Ship Self-Defense System (SSDS) is a combat system specifically designed for anti-air defense of aircraft carriers, and most other non-Aegis United States Navy combat ships. It coordinates several existing shipboard systems. Multi-sensor i ...
with
TRIM Trim or TRIM may refer to: Cutting * Cutting or trimming small pieces off something to remove them ** Book trimming, a stage of the publishing process ** Pruning, trimming as a form of pruning often used on trees Decoration * Trim (sewing), or ...
) * Send/receive (saving
diff In computing, the utility diff is a data comparison tool that computes and displays the differences between the contents of files. Unlike edit distance notions used for other purposes, diff is line-oriented rather than character-oriented, but i ...
s between snapshots to a binary stream) *
Incremental backup An incremental backup is one in which successive copies of the data contain only the portion that has changed since the preceding backup copy was made. When a full recovery is needed, the restoration process would need the last full backup plus al ...
* Out-of-band
data deduplication In computing, data deduplication is a technique for eliminating duplicate copies of repeating data. Successful implementation of the technique can improve storage utilization, which may in turn lower capital expenditure by reducing the overall amou ...
(requires userspace tools) * Ability to handle
swap file In computer operating systems, memory paging is a memory management scheme that allows the physical memory used by a program to be non-contiguous. This also helps avoid the problem of memory fragmentation and requiring compaction to reduce fra ...
s and swap partitions


Implemented but not recommended for production use

* Hierarchical per-subvolume quotas * RAID 5,
RAID 6 In computer storage, the standard RAID levels comprise a basic set of RAID ("redundant array of independent disks" or "redundant array of inexpensive disks") configurations that employ the techniques of striping, mirroring, or parity to create la ...
(fail to guard against
write hole RAID (; redundant array of inexpensive disks or redundant array of independent disks) is a data storage virtualization technology that combines multiple physical data storage components into one or more logical units for the purposes of data redu ...
s)


Cloning

Btrfs provides a ''clone'' operation that atomically creates a copy-on-write snapshot of a file. Such cloned files are sometimes referred to as '' reflinks'', in light of the proposed associated Linux kernel
system call In computing, a system call (syscall) is the programmatic way in which a computer program requests a service from the operating system on which it is executed. This may include hardware-related services (for example, accessing a hard disk drive ...
. By cloning, the file system does not create a new link pointing to an existing
inode An inode (index node) is a data structure in a Unix-style file system that describes a file-system object such as a file or a directory. Each inode stores the attributes and disk block locations of the object's data. File-system object attribu ...
; instead, it creates a new inode that initially shares the same disk blocks with the original file. As a result, cloning works only within the boundaries of the same Btrfs file system, but since version 3.6 of the Linux kernel it may cross the boundaries of subvolumes under certain circumstances. The actual data blocks are not duplicated; at the same time, due to the copy-on-write (CoW) nature of Btrfs, modifications to any of the cloned files are not visible in the original file and vice versa. Cloning should not be confused with
hard links Hard means something that is difficult to do. It may also refer to: * Hardness, resistance of physical materials to deformation or fracture * Hard water, water with high mineral content Arts and entertainment * Hard (TV series), ''Hard'' (TV ser ...
, which are directory entries that associate multiple file names with a single file. While hard links can be taken as different names for the same file, cloning in Btrfs provides independent files that initially share all their disk blocks. Support for this Btrfs feature was added in version 7.5 of the
GNU coreutils The GNU Core Utilities or coreutils is a collection of GNU software that implements many standard, Unix-based shell commands. The utilities generally provide POSIX compliant interface when the environment variable is set, but otherwise offers a ...
, via the --reflink option to the cp command. In addition to data cloning (), Btrfs also supports out-of-band deduplication via . This functionality allows two files with (even partially) identical data to share storage.


Subvolumes and snapshots

A Btrfs subvolume can be thought of as a separate POSIX file
namespace In computing, a namespace is a set of signs (''names'') that are used to identify and refer to objects of various kinds. A namespace ensures that all of a given set of objects have unique names so that they can be easily identified. Namespaces ...
, mountable separately by passing subvol or subvolid options to the utility. It can also be accessed by mounting the top-level subvolume, in which case subvolumes are visible and accessible as its subdirectories. Subvolumes can be created at any place within the file system hierarchy, and they can also be nested. Nested subvolumes appear as subdirectories within their parent subvolumes, similarly to the way a top-level subvolume presents its subvolumes as subdirectories. Deleting a subvolume is not possible until all subvolumes below it in the nesting hierarchy are deleted; as a result, top-level subvolumes cannot be deleted. Any Btrfs file system always has a default subvolume, which is initially set to be the top-level subvolume, and is mounted by default if no subvolume selection option is passed to mount. The default subvolume can be changed as required. A Btrfs
snapshot Snapshot, snapshots or snap shot may refer to: * Snapshot (photography), a photograph taken without preparation Computing * Snapshot (computer storage), the state of a system at a particular point in time * Snapshot (file format) or SNP, a file ...
is a subvolume that shares its data (and metadata) with some other subvolume, using Btrfs' copy-on-write capabilities, and modifications to a snapshot are not visible in the original subvolume. Once a writable snapshot is made, it can be treated as an alternate version of the original file system. For example, to roll back to a snapshot, a modified original subvolume needs to be unmounted and the snapshot needs to be mounted in its place. At that point, the original subvolume may also be deleted. The copy-on-write (CoW) nature of Btrfs means that snapshots are quickly created, while initially consuming very little disk space. Since a snapshot is a subvolume, creating nested snapshots is also possible. Taking snapshots of a subvolume is not a recursive process; thus, if a snapshot of a subvolume is created, every subvolume or snapshot that the subvolume already contains is mapped to an empty directory of the same name inside the snapshot. Taking snapshots of a directory is not possible, as only subvolumes can have snapshots. However, there is a workaround that involves reflinks spread across subvolumes: a new subvolume is created, containing cross-subvolume reflinks to the content of the targeted directory. Having that available, a snapshot of this new volume can be created. A subvolume in Btrfs is quite different from a traditional
Logical Volume Manager In computer storage, logical volume management or LVM provides a method of allocating space on mass-storage devices that is more flexible than conventional partitioning schemes to store volumes. In particular, a volume manager can concatenate, ...
(LVM) logical volume. With LVM, a logical volume is a separate
block device In Unix-like operating systems, a device file, device node, or special file is an interface to a device driver that appears in a file system as if it were an ordinary file. There are also special files in DOS, OS/2, and Windows. These spec ...
, while a Btrfs subvolume is not and it cannot be treated or used that way. Making dd or LVM snapshots of btrfs leads to data loss if either the original or the copy is mounted while both are on the same computer.


Send–receive

Given any pair of subvolumes (or snapshots), Btrfs can generate a binary
diff In computing, the utility diff is a data comparison tool that computes and displays the differences between the contents of files. Unlike edit distance notions used for other purposes, diff is line-oriented rather than character-oriented, but i ...
between them (by using the btrfs send command) that can be replayed later (by using btrfs receive), possibly on a different Btrfs file system. The send–receive feature effectively creates (and applies) a set of data modifications required for converting one subvolume into another. The send/receive feature can be used with regularly scheduled snapshots for implementing a simple form of file system replication, or for the purpose of performing
incremental backup An incremental backup is one in which successive copies of the data contain only the portion that has changed since the preceding backup copy was made. When a full recovery is needed, the restoration process would need the last full backup plus al ...
s.


Quota groups

A ''quota group'' (or ''qgroup'') imposes an upper limit to the space a subvolume or snapshot may consume. A new snapshot initially consumes no quota because its data is shared with its parent, but thereafter incurs a charge for new files and copy-on-write operations on existing files. When quotas are active, a quota group is automatically created with each new subvolume or snapshot. These initial quota groups are building blocks which can be grouped (with the btrfs qgroup command) into hierarchies to implement quota pools. Quota groups only apply to subvolumes and snapshots, while having quotas enforced on individual subdirectories, users, or user groups is not possible. However, workarounds are possible by using different subvolumes for all users or user groups that require a quota to be enforced.


In-place conversion from ext2/3/4 and ReiserFS

As the result of having very little metadata anchored in fixed locations, Btrfs can warp to fit unusual spatial layouts of the backend storage devices. The btrfs-convert tool exploits this ability to do an in-place conversion of an ext2/3/4 or
ReiserFS ReiserFS is a general-purpose, journaling file system initially designed and implemented by a team at Namesys led by Hans Reiser and licensed under GPLv2. Introduced in version 2.4.1 of the Linux kernel, it was the first journaling file syst ...
file system, by nesting the equivalent Btrfs metadata in its unallocated space—while preserving an unmodified copy of the original file system. The conversion involves creating a copy of the whole ext2/3/4 metadata, while the Btrfs files simply point to the same blocks used by the ext2/3/4 files. This makes the bulk of the blocks shared between the two filesystems before the conversion becomes permanent. Thanks to the copy-on-write nature of Btrfs, the original versions of the file data blocks are preserved during all file modifications. Until the conversion becomes permanent, only the blocks that were marked as free in ext2/3/4 are used to hold new Btrfs modifications, meaning that the conversion can be undone at any time (although doing so will erase any changes made after the conversion to Btrfs). All converted files are available and writable in the default subvolume of the Btrfs. A sparse file holding all of the references to the original ext2/3/4 filesystem is created in a separate subvolume, which is mountable on its own as a read-only disk image, allowing both original and converted file systems to be accessed at the same time. Deleting this sparse file frees up the space and makes the conversion permanent. In 4.x versions of the mainline Linux kernel, the in-place ext3/4 conversion was considered untested and rarely used. However, the feature was rewritten from scratch in 2016 for btrfs-progs 4.6. and has been considered stable since then. In-place conversion from ReiserFS was introduced in September 2017 with kernel 4.13.


Union mounting / seed devices

When creating a new Btrfs, an existing Btrfs can be used as a read-only "seed" file system. The new file system will then act as a copy-on-write overlay on the seed, as a form of
union mount In computer operating systems, union mounting is a way of combining multiple directories into one that appears to contain their combined contents. Union mounting is supported in Linux, Berkeley Software Distribution, BSD and several of its success ...
ing. The seed can be later detached from the Btrfs, at which point the rebalancer will simply copy over any seed data still referenced by the new file system before detaching. Mason has suggested this may be useful for a
Live CD A live CD (also live DVD, live disc, or live operating system) is a complete booting, bootable computer installation including operating system which runs directly from a CD-ROM or similar storage device into a computer's memory, rather than lo ...
installer, which might boot from a read-only Btrfs seed on an optical disc, rebalance itself to the target partition on the install disk in the background while the user continues to work, then eject the disc to complete the installation without rebooting.


Encryption

In his 2009 interview, Mason stated that support for encryption was planned for Btrfs. In the meantime, a workaround for combining encryption with Btrfs is to use a full-disk encryption mechanism such as
dm-crypt dm-crypt is a transparent block device encryption subsystem in Linux kernel versions 2.6 and later and in DragonFly BSD. It is part of the device mapper (dm) infrastructure, and uses cryptographic routines from the kernel's Crypto API. Unlike ...
 /
LUKS The Linux Unified Key Setup (LUKS) is a disk encryption specification created by Clemens Fruhwirth in 2004 and originally intended for Linux. LUKS implements a platform-independent standard on-disk format for use in various tools. This facilitate ...
on the underlying devices and to create the Btrfs filesystem on top of that layer. the developers were working to add keyed hash like
HMAC In cryptography, an HMAC (sometimes expanded as either keyed-hash message authentication code or hash-based message authentication code) is a specific type of message authentication code (MAC) involving a cryptographic hash function and a se ...
(
SHA256 SHA-2 (Secure Hash Algorithm 2) is a set of cryptographic hash functions designed by the United States National Security Agency (NSA) and first published in 2001. They are built using the Merkle–Damgård construction, from a one-way compression ...
).


Checking and recovery

Unix systems traditionally rely on "
fsck The system utility fsck (''file system check'') is a tool for checking the consistency of a file system in Unix and Unix-like operating systems, such as Linux Linux ( ) is a family of open source Unix-like operating systems based on the L ...
" programs to check and repair filesystems. This functionality is implemented via the btrfs check program. Since version 4.0 this functionality is deemed relatively stable. However, as of December 2022, the btrfs documentation suggests that its --repair option be used only if you have been advised by "a developer or an experienced user". As of August 2022, the SLE documentation recommends using a Live CD, performing a backup and only using the repair option as a last resort. There is another tool, named btrfs-restore, that can be used to recover files from an unmountable filesystem, without modifying the broken filesystem itself (i.e., non-destructively). In normal use, Btrfs is mostly self-healing and can recover from broken root trees at mount time, thanks to making periodic data flushes to permanent storage, by default every 30 seconds. Thus, isolated errors will cause a maximum of 30 seconds of filesystem changes to be lost at the next mount. This period can be changed by specifying a desired value (in seconds) with the commit mount option.


Design

Ohad Rodeh's original proposal at USENIX 2007 noted that
B+ tree A B+ tree is an m-ary tree with a variable but often large number of children per node. A B+ tree consists of a root, internal nodes and leaves. The root may be either a leaf or a node with two or more children. A B+ tree can be viewed as a B ...
s, which are widely used as on-disk data structures for databases, could not efficiently allow copy-on-write-based snapshots because its leaf nodes were linked together: if a leaf was copied on write, its siblings and parents would have to be as well, as would ''their'' siblings and parents and so on until the entire tree was copied. He suggested instead a modified
B-tree In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing fo ...
(which has no leaf linkage), with a refcount associated to each tree node but stored in an ad hoc free map structure and certain relaxations to the tree's balancing algorithms to make them copy-on-write friendly. The result would be a data structure suitable for a high-performance object store that could perform copy-on-write snapshots, while maintaining good
concurrency Concurrent means happening at the same time. Concurrency, concurrent, or concurrence may refer to: Law * Concurrence, in jurisprudence, the need to prove both ''actus reus'' and ''mens rea'' * Concurring opinion (also called a "concurrence"), a ...
. At Oracle later that year, Mason began work on a snapshot-capable file system that would use this data structure almost exclusively—not just for metadata and file data, but also recursively to track space allocation of the trees themselves. This allowed all traversal and modifications to be funneled through a single code path, against which features such as copy on write, checksumming and mirroring needed to be implemented only once to benefit the entire file system. Btrfs is structured as several layers of such trees, all using the same B-tree implementation. The trees store generic ''items'' sorted by a 136-bit key. The most significant 64 bits of the key are a unique ''object id''. The middle eight bits are an item type field: its use is hardwired into code as an item filter in tree lookups. ''Objects'' can have multiple items of multiple types. The remaining (least significant) 64 bits are used in type-specific ways. Therefore, items for the same object end up adjacent to each other in the tree, grouped by type. By choosing certain key values, objects can further put items of the same type in a particular order. Interior tree nodes are simply flat lists of key-pointer pairs, where the pointer is the logical block number of a child node. Leaf nodes contain item keys packed into the front of the node and item data packed into the end, with the two growing toward each other as the leaf fills up.


File system tree

Within each directory, directory entries appear as ''directory items'', whose least significant bits of key values are a
CRC32C A cyclic redundancy check (CRC) is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to digital data. Blocks of data entering these systems get a short ''check value'' attached, based on th ...
hash of their filename. Their data is a ''location key'', or the key of the
inode An inode (index node) is a data structure in a Unix-style file system that describes a file-system object such as a file or a directory. Each inode stores the attributes and disk block locations of the object's data. File-system object attribu ...
item it points to. Directory items together can thus act as an index for path-to-inode lookups, but are not used for iteration because they are sorted by their hash, effectively randomly permuting them. This means user applications iterating over and opening files in a large directory would thus generate many more disk seeks between non-adjacent files—a notable performance drain in other file systems with hash-ordered directories such as
ReiserFS ReiserFS is a general-purpose, journaling file system initially designed and implemented by a team at Namesys led by Hans Reiser and licensed under GPLv2. Introduced in version 2.4.1 of the Linux kernel, it was the first journaling file syst ...
, ext3 (with Htree-indexes enabled) and ext4, all of which have
TEA Tea is an aromatic beverage prepared by pouring hot or boiling water over cured or fresh leaves of '' Camellia sinensis'', an evergreen shrub native to East Asia which probably originated in the borderlands of south-western China and nor ...
-hashed filenames. To avoid this, each directory entry has a ''directory index item'', whose key value of the item is set to a per-directory counter that increments with each new directory entry. Iteration over these index items thus returns entries in roughly the same order as stored on disk. Files with hard links in multiple directories have multiple reference items, one for each parent directory. Files with multiple hard links in the ''same'' directory pack all of the links' filenames into the same reference item. This was a design flaw that limited the number of same-directory hard links to however many could fit in a single tree block. (On the default block size of 4 KiB, an average filename length of 8 bytes and a per-filename header of 4 bytes, this would be less than 350.) Applications which made heavy use of multiple same-directory hard links, such as
git Git () is a distributed version control system that tracks versions of files. It is often used to control source code by programmers who are developing software collaboratively. Design goals of Git include speed, data integrity, and suppor ...
,
GNUS Gnus (), or Gnus Network User Services, is a message reader which is part of GNU Emacs. It supports reading and composing both e-mail and news and can also act as an RSS reader, web processor, and directory browser for both local and remote file ...
, GMame and BackupPC were observed to fail at this limit. The limit was eventually removed (and as of October 2012 has been merged pending release in Linux 3.7) by introducing spillover ''extended reference items'' to hold hard link filenames which do not otherwise fit.


Extents

File data is kept outside the tree in ''extents'', which are contiguous runs of disk data blocks. Extent blocks default to 4 KiB in size, do not have headers and contain only (possibly compressed) file data. In compressed extents, individual blocks are not compressed separately; rather, the compression stream spans the entire extent. Files have ''extent data items'' to track the extents which hold their contents. The item's key value is the starting byte offset of the extent. This makes for efficient seeks in large files with many extents, because the correct extent for any given file offset can be computed with just one tree lookup. Snapshots and cloned files share extents. When a small part of a large such extent is overwritten, the resulting copy-on-write may create three new extents: a small one containing the overwritten data, and two large ones with unmodified data on either side of the overwrite. To avoid having to re-write unmodified data, the copy-on-write may instead create ''bookend extents'', or extents which are simply slices of existing extents. Extent data items allow for this by including an offset into the extent they are tracking: items for bookends are those with non-zero offsets.


Extent allocation tree

The ''extent allocation tree'' acts as an allocation map for the file system. Unlike other trees, items in this tree do not have object ids. They represent regions of space: their key values hold the starting offsets and lengths of the regions they represent. The file system divides its allocated space into ''block groups'' which are variable-sized allocation regions that alternate between preferring metadata extents (tree nodes) and data extents (file contents). The default ratio of data to metadata block groups is 1:2. They are intended to use concepts of the
Orlov block allocator The Orlov block allocator is an algorithm to define where a particular file will reside on a given file system (blockwise), so as to speed up disk operations. Etymology The scheme is named after its creator Grigoriy Orlov, who first posted, in ...
to allocate related files together and resist fragmentation by leaving free space between groups. (Ext3 block groups, however, have fixed locations computed from the size of the file system, whereas those in Btrfs are dynamic and created as needed.) Each block group is associated with a ''block group item''. Inode items in the file system tree include a reference to their current block group. ''Extent items'' contain a back-reference to the tree node or file occupying that extent. There may be multiple back-references if the extent is shared between snapshots. If there are too many back-references to fit in the item, they spill out into individual ''extent data reference items''. Tree nodes, in turn, have back-references to their containing trees. This makes it possible to find which extents or tree nodes are in any region of space by doing a B-tree range lookup on a pair of offsets bracketing that region, then following the back-references. For relocating data, this allows an efficient upwards traversal from the relocated blocks to quickly find and fix all downwards references to those blocks, without having to scan the entire file system. This, in turn, allows the file system to efficiently shrink, migrate, and defragment its storage online. The extent allocation tree, as with all other trees in the file system, is copy-on-write. Writes to the file system may thus cause a cascade whereby changed tree nodes and file data result in new extents being allocated, causing the extent tree itself to change. To avoid creating a
feedback loop Feedback occurs when outputs of a system are routed back as inputs as part of a chain of cause and effect that forms a circuit or loop. The system can then be said to ''feed back'' into itself. The notion of cause-and-effect has to be handle ...
, extent tree nodes which are still in memory but not yet committed to disk may be updated in place to reflect new copied-on-write extents. In theory, the extent allocation tree makes a conventional
free-space bitmap Free-space bitmaps are one method used to track allocated sectors by some file systems. While the most simplistic design is highly inefficient, advanced or hybrid implementations of free-space bitmaps are used by some modern file systems. Examp ...
unnecessary because the extent allocation tree acts as a B-tree version of a
BSP tree In computer science, binary space partitioning (BSP) is a method for space partitioning which recursively subdivides a Euclidean space into two convex sets by using hyperplanes as partitions. This process of subdividing gives rise to a represent ...
. In practice, however, an in-memory
red–black tree In computer science, a red–black tree is a self-balancing binary search tree data structure noted for fast storage and retrieval of ordered information. The nodes in a red-black tree hold an extra "color" bit, often drawn as red and black, wh ...
of
page Page most commonly refers to: * Page (paper), one side of a leaf of paper, as in a book Page, PAGE, pages, or paging may also refer to: Roles * Page (assistance occupation), a professional occupation * Page (servant), traditionally a young m ...
-sized bitmaps is used to speed up allocations. These bitmaps are persisted to disk (starting in Linux 2.6.37, via the space_cache mount option) as special extents that are exempt from checksumming and copy-on-write.


Checksum tree and scrubbing

CRC-32C A cyclic redundancy check (CRC) is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to digital data. Blocks of data entering these systems get a short ''check value'' attached, based on th ...
checksums are computed for both data and metadata and stored as ''checksum items'' in a ''checksum tree''. There is room for 256 bits of metadata checksums and up to a full node (roughly 4 KB or more) for data checksums. Btrfs has provisions for additional checksum algorithms to be added in future versions of the file system. There is one checksum item per contiguous run of allocated blocks, with per-block checksums packed end-to-end into the item data. If there are more checksums than can fit, they spill into another checksum item in a new leaf. If the file system detects a checksum mismatch while reading a block, it first tries to obtain (or create) a good copy of this block from another device if internal mirroring or RAID techniques are in use. Btrfs can initiate an online check of the entire file system by triggering a file system scrub job that is performed in the background. The scrub job scans the entire file system for integrity and automatically attempts to report and repair any bad blocks it finds along the way.


Log tree

An
fsync sync is a standard system call in the Unix operating system, which commits all data from the kernel filesystem buffers to non-volatile storage, i.e., data which has been scheduled for writing via low-level I/O system calls. Higher-level I/O la ...
request commits modified data immediately to stable storage. fsync-heavy workloads (like a
database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
or a
virtual machine In computing, a virtual machine (VM) is the virtualization or emulator, emulation of a computer system. Virtual machines are based on computer architectures and provide the functionality of a physical computer. Their implementations may involve ...
whose running OS ''fsyncs'' frequently) could potentially generate a great deal of redundant write I/O by forcing the file system to repeatedly copy-on-write and flush frequently modified parts of trees to storage. To avoid this, a temporary per-subvolume ''log tree'' is created to
journal A journal, from the Old French ''journal'' (meaning "daily"), may refer to: *Bullet journal, a method of personal organization *Diary, a record of personal secretive thoughts and as open book to personal therapy or used to feel connected to onesel ...
fsync-triggered copies on write. Log trees are self-contained, tracking their own extents and keeping their own checksum items. Their items are replayed and deleted at the next full tree commit or (if there was a system crash) at the next remount.


Chunk and device trees

Block device In Unix-like operating systems, a device file, device node, or special file is an interface to a device driver that appears in a file system as if it were an ordinary file. There are also special files in DOS, OS/2, and Windows. These spec ...
s are divided into ''physical chunks'' of 1 GiB for data and 256 MiB for metadata. Physical chunks across multiple devices can be mirrored or striped together into a single ''logical chunk''. These logical chunks are combined into a single logical address space that the rest of the filesystem uses. The ''chunk tree'' tracks this by storing each device therein as a ''device item'' and logical chunks as ''chunk map items'', which provide a forward mapping from logical to physical addresses by storing their offsets in the least significant 64 bits of their key. Chunk map items can be one of several different types: ; single : 1 logical to 1 physical chunk ; dup : 1 logical chunk to 2 physical chunks on 1 block device ; raid0 : N logical chunks to N≥2 physical chunks across N≥2 block devices ; raid1 : 1 logical chunk to 2 physical chunks across 2 out of N≥2 block devices, in contrast to conventional
RAID 1 In computer storage, the standard RAID levels comprise a basic set of RAID ("redundant array of independent disks" or "redundant array of inexpensive disks") configurations that employ the techniques of striping, mirroring, or parity to create la ...
which has N physical chunks ; raid1c3 : 1 logical chunk to 3 physical chunks out of N≥3 block devices ; raid1c4 : 1 logical chunk to 4 physical chunks out of N≥4 block devices ; raid5 : N (for N≥2) logical chunks to N+1 physical chunks across N+1 block devices, with 1 physical chunk used as parity ; raid6 : N (for N≥2) logical chunks to N+2 physical chunks across N+2 block devices, with 2 physical chunks used as parity ''N'' is the number of block devices still having free space when the chunk is allocated. If N is not large enough for the chosen mirroring/mapping, then the filesystem is effectively out of space.


Relocation trees

Defragmentation, shrinking, and rebalancing operations require extents to be relocated. However, doing a simple copy-on-write of the relocating extent will break sharing between snapshots and consume disk space. To preserve sharing, an update-and-swap algorithm is used, with a special ''relocation tree'' serving as scratch space for affected metadata. The extent to be relocated is first copied to its destination. Then, by following backreferences upward through the affected subvolume's file system tree, metadata pointing to the old extent is progressively updated to point at the new one; any newly updated items are stored in the relocation tree. Once the update is complete, items in the relocation tree are swapped with their counterparts in the affected subvolume, and the relocation tree is discarded.


Superblock

All the file system's trees—including the chunk tree itself—are stored in chunks, creating a potential
bootstrapping In general, bootstrapping usually refers to a self-starting process that is supposed to continue or grow without external input. Many analytical techniques are often called bootstrap methods in reference to their self-starting or self-supporting ...
problem when mounting the file system. To bootstrap into a mount, a list of physical addresses of chunks belonging to the chunk and root trees are stored in the '' superblock''. ''Superblock mirrors'' are kept at fixed locations: 64 KiB into every block device, with additional copies at 64 MiB, 256 GiB and 1 PiB. When a superblock mirror is updated, its ''generation number'' is incremented. At mount time, the copy with the highest generation number is used. All superblock mirrors are updated in tandem, except in
SSD A solid-state drive (SSD) is a type of solid-state storage device that uses Integrated circuit, integrated circuits to store data persistence (computer science), persistently. It is sometimes called semiconductor storage device, solid-stat ...
mode which alternates updates among mirrors to provide some
wear levelling Wear leveling (also written as wear levelling) is a technique Wear leveling techniques for flash memory systems. for prolonging the service life of some kinds of erasable computer storage media, such as flash memory, which is used in solid-state d ...
.


Commercial support


Supported

*
Oracle Linux Oracle Linux (abbreviated OL, formerly known as Oracle Enterprise Linux or OEL) is a Linux distribution packaged and freely distributed by Oracle, available partially under the GNU General Public License since late 2006. It is, in part, compile ...
from version 7 *
SUSE Linux Enterprise Server SUSE Linux Enterprise (SLE) is a Linux-based operating system developed by SUSE. It is available in two editions, suffixed with Server (SLES) for servers and mainframes, and Desktop (SLED) for workstations and desktop computers. Its major ve ...
from version 12 *
Synology Synology Inc. () is a Taiwanese corporation that specializes in network-attached storage (NAS) Computer appliance, appliances. Synology's line of NAS is known as the DiskStation for desktop models, FlashStation for Flash memory, all-flash model ...
DiskStation Manager (DSM) from version 6.0


No longer supported

* Btrfs was included as a "technology preview" in
Red Hat Enterprise Linux Red Hat Enterprise Linux (RHEL) is a commercial Linux distribution developed by Red Hat. Red Hat Enterprise Linux is released in server versions for x86-64, Power ISA, ARM64, and IBM Z and a desktop version for x86-64. Fedora Linux and ...
6 and 7; it was removed in RHEL 8 in 2018.


See also

*
APFS Apple File System (APFS) is a proprietary file system developed and deployed by Apple Inc. for macOS Sierra (10.12.4) and later, iOS 10.3, tvOS 10.2, watchOS 3.2, and all versions of iPadOS. It aims to fix core problems of HFS+ (also ca ...
– a copy-on-write file system for macOS, iPadOS, iOS, tvOS and watchOS *
Bcachefs Bcachefs is a copy-on-write (COW) file system for Linux-based operating systems. Its primary developer, Kent Overstreet, first announced it in 2015, and it was added to the Linux kernel beginning with 6.7. It is intended to compete with the moder ...
*
Comparison of file systems The following tables compare general and technical information for a number of file systems. General information Metadata All widely used file systems record a last modified time stamp (also known as "mtime"). It is not included i ...
*
HAMMER A hammer is a tool, most often a hand tool, consisting of a weighted "head" fixed to a long handle that is swung to deliver an impact to a small area of an object. This can be, for example, to drive nail (fastener), nails into wood, to sh ...
– DragonFly BSD's file system that uses B-trees, paired with checksums as a countermeasure for data corruption *
List of file systems The following lists identify, characterize, and link to more thorough information on file systems. Many older operating systems support only their one "native" file system, which does not bear any name apart from the name of the operating system i ...
*
ReFS Resilient File System (ReFS), codenamed "Protogon", is a Microsoft proprietary file system introduced with Windows Server 2012 with the intent of becoming the "next generation" file system after NTFS. ReFS was designed to overcome problem ...
– a copy-on-write file system for
Windows Server 2012 Windows Server 2012, codenamed "Windows Server 8", is the ninth major version of the Windows NT operating system produced by Microsoft to be released under the Windows Server brand name. It is the server version of Windows based on Windows ...
*
ZFS ZFS (previously Zettabyte File System) is a file system with Volume manager, volume management capabilities. It began as part of the Sun Microsystems Solaris (operating system), Solaris operating system in 2001. Large parts of Solaris, includin ...


Notes


References


External links

* * a conference presentation by Avi Miller, an Oracle engineer
Btrfs: Working with multiple devices
LWN.net LWN.net is a computing webzine with an emphasis on free software and software for Linux and other Unix-like operating systems. It consists of a weekly issue, separate stories which are published most days, and threaded discussion attached to ever ...
, December 2013, by Jonathan Corbet
Marc's Linux Btrfs posts
detailed insights into various Btrfs features
Btrfs overview
LinuxCon 2014, by Marc Merlin
WinBtrfs
Btrfs Driver For
Windows Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...
{{Portal bar, Free and open-source software, Linux Compression file systems File systems supported by the Linux kernel Linux file system-related software Disk file systems