Shard (database architecture) explained

A database shard, or simply a shard, is a horizontal partition of data in a database or search engine. Each shard may be held on a separate database server instance, to spread load.

Some data within a database remains present in all shards, but some appear only in a single shard. Each shard (or server) acts as the single source for this subset of data.[1]

Database architecture

Horizontal partitioning is a database design principle whereby rows of a database table are held separately, rather than being split into columns (which is what normalization and vertical partitioning do, to differing extents). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location.

There are numerous advantages to the horizontal partitioning approach. Since the tables are divided and distributed into multiple servers, the total number of rows in each table in each database is reduced. This reduces index size, which generally improves search performance. A database shard can be placed on separate hardware, and multiple shards can be placed on multiple machines. This enables a distribution of the database over a large number of machines, greatly improving performance. In addition, if the database shard is based on some real-world segmentation of the data (e.g., European customers v. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard.[2]

In practice, sharding is complex. Although it has been done for a long time by hand-coding (especially where rows have an obvious grouping, as per the example above), this is often inflexible. There is a desire to support sharding automatically, both in terms of adding code support for it, and for identifying candidates to be sharded separately. Consistent hashing is a technique used in sharding to spread large loads across multiple smaller services and servers.[3]

Where distributed computing is used to separate load between multiple servers (either for performance or reliability reasons), a shard approach may also be useful. In the 2010s, sharding of execution capacity, as well as the more traditional sharding of data, has emerged as a potential approach to overcome performance and scalability problems in blockchains.[4] [5]

Compared to horizontal partitioning

Horizontal partitioning splits one or more tables by row, usually within a single instance of a schema and a database server. It may offer an advantage by reducing index size (and thus search effort) provided that there is some obvious, robust, implicit way to identify in which partition a particular row will be found, without first needing to search the index, e.g., the classic example of the 'CustomersEast' and 'CustomersWest' tables, where their ZIP code already indicates where they will be found.

Sharding goes beyond this. It partitions the problematic table(s) in the same way, but it does this across potentially multiple instances of the schema. The obvious advantage would be that search load for the large partitioned table can now be split across multiple servers (logical or physical), not just multiple indexes on the same logical server.

Splitting shards across multiple isolated instances requires more than simple horizontal partitioning. The hoped-for gains in efficiency would be lost, if querying the database required multiple instances to be queried, just to retrieve a simple dimension table. Beyond partitioning, sharding thus splits large partitionable tables across the servers, while smaller tables are replicated as complete units.

This is also why sharding is related to a shared-nothing architecture—once sharded, each shard can live in a totally separate logical schema instance / physical database server / data center / continent. There is no ongoing need to retain shared access (from between shards) to the other unpartitioned tables in other shards.

This makes replication across multiple servers easy (simple horizontal partitioning does not). It is also useful for worldwide distribution of applications, where communications links between data centers would otherwise be a bottleneck.

There is also a requirement for some notification and replication mechanism between schema instances, so that the unpartitioned tables remain as closely synchronized as the application demands. This is a complex choice in the architecture of sharded systems: approaches range from making these effectively read-only (updates are rare and batched), to dynamically replicated tables (at the cost of reducing some of the distribution benefits of sharding) and many options in between.

Implementations

Disadvantages

Sharding a database table before it has been optimized locally causes premature complexity. Sharding should be used only when all other options for optimization are inadequate. The introduced complexity of database sharding causes the following potential problems:

Etymology

In a database context, most recognize the term "shard" is most likely derived from either one of two sources: Computer Corporation of America's "A System for Highly Available Replicated Data",[28] which utilized redundant hardware to facilitate data replication (as opposed to horizontal partitioning); or the critically acclaimed 1997 MMORPG video game Ultima Online which set 8 Guinness World Records and was designated by Time as one of the 100 greatest video games produced of all time.[29]

Richard Garriott, creator of Ultima Online, recollects the term being coined during production phase when they attempted to create a self-regulating virtual ecology system, whereby players may leverage new internet access (a revolutionary technology at the time) to interact and harvest in-game resources.[30] Although the virtual ecology functioned as intended during in-house testing, its natural balance failed "almost instantaneously" due to players killing off every living wildlife across the playable area faster than the spawning system could operate. Garriott's production team attempted to mitigate this issue by separating the global player base into separate sessions, and rewriting part of Ultima Online fictional connection to the end of , where the defeat of its antagonist Mondain also led to the creation of multiverse "shards". This modification provided Garriott's team with the fictional basis needed to justify creating copies of the virtual environment. However, the game's sharp rise to critical acclaim also meant that the new multiverse virtual ecology system was quickly overwhelmed as well. After several months of testing, Garriott's team decided to abandon the feature altogether, and stripped the game of its functionality.

Today, the term "shard" refers to the deployment and use of redundant hardware across database systems.

See also

External links

Notes and References

  1. Book: NoSQL Distilled . Pramod J. . Sadalage . Martin Fowler (software engineer) . Martin . Fowler . 4: Distribution Models . 978-0321826626 . 2012 . Pearson Education .
  2. Web site: Shard - A Database Design . Rahul Roy . July 28, 2008.
  3. Web site: Ries. Eric. Sharding for Startups.
  4. Book: https://dl.acm.org/doi/abs/10.1145/3318041.3355457 . 21 October 2019 . 41–61 . Wang. Gang . Shi. Zhijie Jerry . Nixon. Mark . Han. Song . Proceedings of the 1st ACM Conference on Advances in Financial Technologies . SoK . 10.1145/3318041.3355457 . 9781450367325 . 204749727 .
  5. https://dl.acm.org/doi/abs/10.1145/3318041.3355457 . 18 July 2020 . FC 2020: Financial Cryptography and Data Security . 114–134 . Yu. Mingchao . Sahraei. Saeid . Nixon. Mark . Han. Song . Proceedings of the 1st ACM Conference on Advances in Financial Technologies . SoK: Sharding on Blockchain . 10.1145/3318041.3355457 . 9781450367325 . 204749727 .
  6. Web site: Apache HBase – Apache HBase™ Home. hbase.apache.org.
  7. Web site: Introducing Elastic Scale preview for Azure SQL Database. azure.microsoft.com. 2 October 2014 .
  8. Web site: Alibaba Cloud Help Center - Cloud Definition and Explanation of Cloud Based Services - Alibaba Cloud. www.alibabacloud.com.
  9. Web site: Focuses on Large-Scale Online Databases - Alibaba Cloud. www.alibabacloud.com.
  10. Web site: Index Shard Allocation | Elasticsearch Guide [7.13] | Elastic]. www.elastic.co.
  11. Web site: IBM Docs.
  12. Web site: Hibernate Shards . 2007-02-08.
  13. Web site: Hibernate Shards. 2011-03-30. 2008-12-16. https://web.archive.org/web/20081216005922/http://www.hibernate.org/hib_docs/shards/reference/en/html/. dead.
  14. Web site: New Grid queries for Informix .
  15. Web site: NoSQL support in Informix (JSON storage, Mongo DB API). September 24, 2013.
  16. Web site: Spider . 2022-12-20 . MariaDB KnowledgeBase.
  17. Web site: MonetDB July2015 Released . 31 August 2015.
  18. Web site: MySQL Cluster Features & Benefits . 2012-11-23.
  19. Web site: MySQL Fabric sharding quick start guide.
  20. Web site: Oracle Sharding . Oracle . 2018-05-24 . 2021-07-10.
  21. Web site: DistributedSearch - SOLR - Apache Software Foundation. cwiki.apache.org.
  22. Web site: James C . Corbett . Jeffrey . Dean . Michael . Epstein . Andrew . Fikes . Christopher . Frost . JJ . Furman . Sanjay . Ghemawat . Andrey . Gubarev . Christopher . Heiser . Peter . Hochschild . Wilson . Hsieh . Sebastian . Kanthak . Eugene . Kogan . Hongyi . Li . Alexander . Lloyd . Sergey . Melnik . David . Mwaura . David . Nagle . Sean . Quinlan . Rajesh . Rao . Lindsay . Rolig . Yasushi . Saito . Michal . Szymaniak . Christopher . Taylor . Ruth . Wang . Dale . Woodford . Spanner: Google's Globally-Distributed Database . Proceedings of OSDI 2012 . 24 February 2014.
  23. Web site: sqlalchemy/sqlalchemy. July 9, 2021. GitHub.
  24. Web site: Partitioning and Sharding Options for SQL Server and SQL Azure. infoq.com.
  25. Web site: A faster, more efficient cryptocurrency. MIT News. 24 January 2019 . 2019-01-30.
  26. Web site: Vitess. vitess.io.
  27. Web site: ShardingSphere. shardingsphere.apache.org.
  28. Sarin, DeWitt & Rosenberg, Overview of SHARD: A System for Highly Available Replicated Data, Technical Report CCA-88-01, Computer Corporation of America, May 1988
  29. Web site: Database "sharding" came from UO?. Koster. Raph. 2009-01-08. Raph Koster's Website. 2015-01-17.
  30. Web site: Ultima Online: The Virtual Ecology | War Stories. Ars Technica Videos. 21 December 2017 .