In
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
, the log-structured merge-tree (also known as LSM tree, or LSMT) is a
data structure
In computer science, a data structure is a data organization, management, and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the rel ...
with performance characteristics that make it attractive for providing
indexed access to files with high insert volume, such as
transactional log data. LSM trees, like other
search tree
In computer science, a search tree is a tree data structure used for locating specific keys from within a set. In order for a tree to function as a search tree, the key for each node must be greater than any keys in subtrees on the left, and les ...
s, maintain key-value pairs. LSM trees maintain data in two or more separate structures, each of which is optimized for its respective underlying storage medium; data is synchronized between the two structures efficiently, in batches.
One simple version of the LSM tree is a two-level LSM tree. As described by
Patrick O'Neil
Patrick Eugene O'Neil (1942 – September 20, 2019) was an American computer scientist, an expert on databases, and a professor of computer science at the University of Massachusetts Boston. , a two-level LSM tree comprises two
tree-like structures, called C
0 and C
1. C
0 is smaller and entirely resident in memory, whereas C
1 is resident on disk. New records are inserted into the memory-resident C
0 component. If the insertion causes the C
0 component to exceed a certain size threshold, a contiguous segment of entries is removed from C
0 and merged into C
1 on disk. The performance characteristics of LSM trees stem from the fact that each component is tuned to the characteristics of its underlying storage medium, and that data is efficiently migrated across media in rolling batches, using an algorithm reminiscent of
merge sort
In computer science, merge sort (also commonly spelled as mergesort) is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the order of equal elements is the same ...
.

Most LSM trees used in practice employ multiple levels. Level 0 is kept in main memory, and might be represented using a tree. The on-disk data is organized into sorted ''runs'' of data. Each run contains data sorted by the index key. A run can be represented on disk as a single file, or alternatively as a collection of files with non-overlapping key ranges. To perform a query on a particular key to get its associated value, one must search in the Level 0 tree and also each run.
The Stepped-Merge version of the LSM tree is a variant of the LSM tree that supports multiple levels with multiple tree structures at each level.
A particular key may appear in several runs, and what that means for a query depends on the application. Some applications simply want the newest key-value pair with a given key. Some applications must combine the values in some way to get the proper aggregate value to return. For example, in
Apache Cassandra
Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cass ...
, each value represents a row in a database, and different versions of the row may have different sets of columns.
In order to keep down the cost of queries, the system must avoid a situation where there are too many runs.
Extensions to the 'leveled' method to incorporate
B+ tree
A B+ tree is an m-ary tree with a variable but often large number of children per node. A B+ tree consists of a root, internal nodes and leaves. The root may be either a leaf or a node with two or more children.
A B+ tree can be viewed as a B ...
structures have been suggested, for example bLSM and Diff-Index. LSM-tree was originally designed for write-intensive workloads. As increasingly more read and write workloads co-exist under an LSM-tree storage structure, read data accesses can experience high latency and low throughput due to frequent invalidations of cached data in buffer caches by LSM-tree compaction operations. To re-enable effective buffer caching for fast data accesses, a Log-Structured buffered-Merged tree (LSbM-tree) is proposed and implemented.
LSM trees are used in data stores such a
Apache AsterixDB Bigtable
Bigtable is a fully managed wide-column and key-value NoSQL database service for large analytical and operational workloads as part of the Google Cloud portfolio.
History
Bigtable development began in 2004.. It is now used by a number of Googl ...
,
HBase
HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Fil ...
,
LevelDB,
Apache Accumulo
Apache Accumulo is a highly scalable sorted, distributed key-value store based on Google's Bigtable. It is a system built on top of Apache Hadoop, Apache ZooKeeper, and Apache Thrift. Written in Java, Accumulo has cell-level access labels a ...
,
SQLite4
SQLite (, ) is a database engine written in the C programming language. It is not a standalone app; rather, it is a library that software developers embed in their apps. As such, it belongs to the family of embedded databases. It is the mos ...
,
Tarantool
Tarantool is an in-memory computing platform with a flexible data schema, best used for creating high-performance applications. Two main parts of it are an in-memory database and a Lua application server.
Tarantool maintains data in memory and e ...
,
RocksDB
RocksDB is a high performance embedded database for key-value data. It is a fork of Google's LevelDB optimized to exploit many CPU cores, and make efficient use of fast storage, such as solid-state drives (SSD), for input/output (I/O) bound ...
,
WiredTiger,
Apache Cassandra
Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cass ...
,
InfluxDB
InfluxDB is an open-source time series database (TSDB) developed by the company InfluxData. It is written in the Go programming language for storage and retrieval of time series data in fields such as operations monitoring, application metr ...
and
ScyllaDB.
References
;General
*
*
*
External links
An Overview of Log Structured Merge Trees
{{CS trees
Trees (data structures)
Database index techniques