HOME

TheInfoList



OR:

A database shard, or simply a shard, is a horizontal partition of data in a
database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
or
search engine A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
. Each shard may be held on a separate
database server A database server is a server which uses a database application that provides database services to other computer programs or to computers, as defined by the client–server model. Database management systems (DBMSs) frequently provide database- ...
instance, to spread load. Some data in a database remains present in all shards, but some appears only in a single shard. Each shard acts as the single source for this subset of data.


Database architecture

Horizontal partitioning is a database design principle whereby '' rows'' of a database table are held separately, rather than being split into columns (which is what
normalization Normalization or normalisation refers to a process that makes something more normal or regular. Science * Normalization process theory, a sociological theory of the implementation of new technologies or innovations * Normalization model, used in ...
and vertical partitioning do, to differing extents). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location. There are numerous advantages to the horizontal partitioning of data. Since tables are divided and distributed into multiple servers, the total number of rows in each table in each database is reduced. This reduces
index Index (: indexes or indices) may refer to: Arts, entertainment, and media Fictional entities * Index (''A Certain Magical Index''), a character in the light novel series ''A Certain Magical Index'' * The Index, an item on the Halo Array in the ...
size, which generally improves search performance. A database shard can be placed on separate hardware, and multiple shards can be placed on multiple machines. This enables a distribution of the database over a large number of machines, greatly improving performance. In addition, if the database shard is based on some real-world segmentation of the data (e.g., European customers v. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard. In practice, sharding is complex. Although it has been done for a long time by hand-coding (especially where rows have an obvious grouping, as in the customer region example above), this is often inflexible. There is a desire to support sharding automatically, both in terms of adding code support for it, and for identifying candidates to be sharded separately. Consistent hashing is a technique used in sharding to spread large loads across multiple smaller services and servers. Where
distributed computing Distributed computing is a field of computer science that studies distributed systems, defined as computer systems whose inter-communicating components are located on different networked computers. The components of a distributed system commu ...
is used to separate load between multiple servers (either for performance or reliability reasons), a shard approach may also be useful. In the 2010s, sharding of
execution Capital punishment, also known as the death penalty and formerly called judicial homicide, is the state-sanctioned killing of a person as punishment for actual or supposed misconduct. The sentence ordering that an offender be punished in ...
capacity, as well as the more traditional sharding of
data Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...
, has emerged as a potential approach to overcome performance and scalability problems in
blockchain The blockchain is a distributed ledger with growing lists of Record (computer science), records (''blocks'') that are securely linked together via Cryptographic hash function, cryptographic hashes. Each block contains a cryptographic hash of th ...
s.


Compared to horizontal partitioning

Horizontal partitioning splits one or more tables by row, usually within a ''single'' instance of a
schema Schema may refer to: Science and technology * SCHEMA (bioinformatics), an algorithm used in protein engineering * Schema (genetic algorithms), a set of programs or bit strings that have some genotypic similarity * Schema.org, a web markup vocab ...
and a database server. It may offer an advantage by reducing index size (and thus search effort) provided that there is some obvious, robust, implicit way to identify in which partition a particular row will be found, without first needing to search the index, e.g., the classic example of the 'CustomersEast' and 'CustomersWest' tables, where their ZIP code already indicates where they will be found. Sharding goes beyond this. It partitions the problematic table(s) in the same way, but it does this across potentially ''multiple'' instances of the schema. The obvious advantage would be that search load for the large partitioned table can now be split across multiple servers (logical or physical), not just multiple indexes on the same logical server. Splitting shards across multiple isolated instances requires more than simple horizontal partitioning. The hoped-for gains in efficiency would be lost, if querying the database required ''multiple'' instances to be queried, just to retrieve a simple
dimension table A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. Commonly used dimensions are people, products, place and time. (Note: People and time sometimes are not modeled as dimensions. ...
. Beyond partitioning, sharding thus splits large partitionable tables across the servers, while smaller tables are replicated as complete units. This is also why sharding is related to a shared-nothing architecture—once sharded, each shard can live in a totally separate logical schema instance / physical database server /
data center A data center is a building, a dedicated space within a building, or a group of buildings used to house computer systems and associated components, such as telecommunications and storage systems. Since IT operations are crucial for busines ...
/
continent A continent is any of several large geographical regions. Continents are generally identified by convention (norm), convention rather than any strict criteria. A continent could be a single large landmass, a part of a very large landmass, as ...
. There is no ongoing need to retain shared access (from between shards) to the other unpartitioned tables in other shards. This makes replication across multiple servers easy (simple horizontal partitioning does not). It is also useful for worldwide distribution of applications, where communications links between data centers would otherwise be a bottleneck. There is also a requirement for some notification and replication mechanism between schema instances, so that the unpartitioned tables remain as closely synchronized as the application demands. This is a complex choice in the architecture of sharded systems: approaches range from making these effectively read-only (updates are rare and batched), to dynamically replicated tables (at the cost of reducing some of the distribution benefits of sharding) and many options in between.


Implementations

* Altibase provides combined (client-side and server-side) sharding architecture transparent to client applications. * Apache HBase can shard automatically. * Azure SQL Database Elastic Database tools shards to scale out and in the data-tier of an application. *
ClickHouse ClickHouse is an open-source column-oriented DBMS (columnar database management system) for online analytical processing (OLAP) that allows users to generate analytical reports using SQL queries in real-time. ClickHouse Inc. is headquartered in ...
, a fast open-source OLAP database management system, shards. * Couchbase shards automatically and transparently. * CUBRID shards since version 9.0 * Db2 Data Partitioning Feature (MPP) which is a shared-nothing database partitions running on separate nodes. * DRDS (Distributed Relational Database Service) of Alibaba Cloud does database/table sharding, and supports Singles' Day. *
Elasticsearch Elasticsearch is a Search engine (computing), search engine based on Apache Lucene, a free and open-source search engine. It provides a distributed, Multitenancy, multitenant-capable full-text search engine with an HTTP web interface and schema ...
enterprise search server shards. * eXtreme Scale is a cross-process in-memory key/value data store (a
NoSQL NoSQL (originally meaning "Not only SQL" or "non-relational") refers to a type of database design that stores and retrieves data differently from the traditional table-based structure of relational databases. Unlike relational databases, which ...
data store). It uses sharding to achieve scalability across processes for both data and
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filte ...
-style parallel processing. * Hibernate shards, but has had little development since 2007. * IBM
Informix Informix is a product family within IBM's Information Management division that is centered on several relational database management system (RDBMS) and multi-model database offerings. The Informix products were originally developed by Inform ...
shards since version 12.1 xC1 as part of the MACH11 technology. Informix 12.10 xC2 added full compatibility with MongoDB drivers, allowing the mix of regular relational tables with NoSQL collections, while still allowing sharding, fail-over and ACID properties. * Kdb+ shards since version 2.0. *
MariaDB MariaDB is a community-developed, commercially supported Fork (software development), fork of the MySQL relational database management system (RDBMS), intended to remain free and open-source software under the GNU General Public License. Developm ...
Spider, an storage engine that supports table federation, table sharding, XA transactions, and ODBC data sources. The MariaDB Spider engine is bundled in MariaDB server since version 10.0.4. * MonetDB, an open-source column-store, does read-only sharding in its July 2015 release. *
MongoDB MongoDB is a source-available, cross-platform, document-oriented database program. Classified as a NoSQL database product, MongoDB uses JSON-like documents with optional database schema, schemas. Released in February 2009 by 10gen (now MongoDB ...
shards since version 1.6. * MySQL Cluster automatically and transparently shards across low-cost commodity nodes, allowing scale-out of read and write queries, without requiring changes to the application. *
MySQL MySQL () is an Open-source software, open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A rel ...
Fabric (part of MySQL utilities) shards. * Oracle Database shards since 12c Release 2 and in one liner: Combination of sharding advantages with well-known capabilities of enterprise ready multi-model Oracle Database. * Oracle NoSQL Database has automatic sharding and elastic, online expansion of the cluster (adding more shards). * OrientDB shards since version 1.7 *
Solr Solr (pronounced "solar") is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features ...
enterprise search server shards. * ScyllaDB runs sharded on each core in a server, across all the servers in a cluster * Spanner, Google's global-scale distributed database, shards across multiple Paxos state machines to scale to "millions of machines across hundreds of data centers and trillions of database rows". * SQLAlchemy ORM, a data-mapper for the
Python programming language Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically type-checked and garbage-collected. It supports multiple prog ...
shards. * SQL Server, since SQL Server 2005 shards with help of 3rd party tools. *
Teradata Teradata Corporation is an American software company that provides cloud database and Analytics, analytics-related software, products, and services. The company was formed in 1979 in Brentwood, California, as a collaboration between researchers a ...
markets a massive parallel database management system as a " data warehouse" *Vault, a
cryptocurrency A cryptocurrency (colloquially crypto) is a digital currency designed to work through a computer network that is not reliant on any central authority, such as a government or bank, to uphold or maintain it. Individual coin ownership record ...
, shards to drastically reduce the data that users need to join the network and verify transactions. This allows the network to scale much more. * Vitess open-source database clustering system shards MySQL. It is a Cloud Native Computing Foundation project. * ShardingSphere related to a database clustering system providing data sharding, distributed transactions, and distributed database management. It is an
Apache Software Foundation The Apache Software Foundation ( ; ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open-source software projects. The ASF was formed from a group of developers of the ...
(ASF) project.


Disadvantages

Sharding a database table before it has been optimized locally causes premature complexity. Sharding should be used only when all other options for optimization are inadequate. The introduced complexity of database sharding causes the following potential problems: * ''SQL complexity'' - Increased bugs because the developers have to write more complicated SQL to handle sharding logic * ''Additional software'' - that partitions, balances, coordinates, and ensures integrity can fail * ''
Single point of failure A single point of failure (SPOF) is a part of a system that would Cascading failure, stop the entire system from working if it were to fail. The term single point of failure implies that there is not a backup or redundant option that would enab ...
'' - Corruption of one shard due to network/hardware/systems problems causes failure of the entire table. * '' Fail-over server complexity'' - Fail-over servers must have copies of the fleets of database shards. * ''
Backup In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event. The verb form, referring to the process of doing so, is "wikt:back ...
s complexity'' - Database backups of the individual shards must be coordinated with the backups of the other shards. * ''Operational complexity'' - Adding/removing indexes, adding/deleting columns, modifying the schema becomes much more difficult.


Etymology

In a database context, most recognize the term "shard" is most likely derived from either one of two sources: Computer Corporation of America's "A System for Highly Available Replicated Data",Sarin, DeWitt & Rosenberg, ''Overview of SHARD: A System for Highly Available Replicated Data'', Technical Report CCA-88-01, Computer Corporation of America, May 1988 which utilized redundant hardware to facilitate data ''replication'' (as opposed to horizontal partitioning); or the critically acclaimed 1997
MMORPG A massively multiplayer online role-playing game (MMORPG) is a video game that combines aspects of a role-playing video game and a massively multiplayer online game. As in role-playing games (RPGs), the player assumes the role of a Player charac ...
video game '' Ultima Online'' which set 8
Guinness World Records ''Guinness World Records'', known from its inception in 1955 until 1999 as ''The Guinness Book of Records'' and in previous United States editions as ''The Guinness Book of World Records'', is a British reference book published annually, list ...
and was designated by ''Time'' as one of the 100 greatest video games produced of all time. Richard Garriott, creator of ''Ultima Online'', recollects the term being coined during production phase when they attempted to create a self-regulating virtual ecology system, whereby players may leverage new internet access (a revolutionary technology at the time) to interact and harvest in-game resources. Although the virtual ecology functioned as intended during in-house testing, its natural balance failed "almost instantaneously" due to players killing off every living wildlife across the playable area faster than the spawning system could operate. Garriott's production team attempted to mitigate this issue by separating the global player base into separate sessions, and rewriting part of ''Ultima Online'' fictional connection to the end of '' Ultima I: The First Age of Darkness'', where the defeat of its antagonist Mondain also led to the creation of
multiverse The multiverse is the hypothetical set of all universes. Together, these universes are presumed to comprise everything that exists: the entirety of space, time, matter, energy, information, and the physical laws and constants that describ ...
"shards". This modification provided Garriott's team with the fictional basis needed to justify creating copies of the virtual environment. However, the game's sharp rise to critical acclaim also meant that the new multiverse virtual ecology system was quickly overwhelmed as well. After several months of testing, Garriott's team decided to abandon the feature altogether, and stripped the game of its functionality. Today, the term "shard" refers to the deployment and use of redundant hardware across database systems.


See also

*
Block Range Index A Block Range Index or BRIN is a database indexing technique. They are intended to improve performance with extremely large tables. BRIN indexes provide similar benefits to horizontal partitioning or sharding but without needing to explicitly decl ...
* Shared-nothing architecture


Notes


References


External links


Informix JSON data sharding
{{Design Patterns patterns Data partitioning Database management systems Software design patterns de:Denormalisierung#Fragmentierung