Table Sharding algorithms in Cloud computing and their optimization

Table Sharding algorithms in Cloud computing and their optimization

Summary

Database Sharding is an enabling technology for achieving high levels of scalability in transactional database applications. Google is credited with inventing the sharding terminology, although Database Sharding has also been successfully implemented by many other high profile online service providers such as Amazon and eBay.

Many Database Sharding implementations take advantage of open?source databases deployed on commodity multi-core server hardware, which results in the most cost-effective method of achieving database application scalability.

There are many techniques and approaches for accomplishing Database Sharding, as well as a number of factors that a successful implementation must consider.

This paper explores the various alternatives and issues, providing a foundational understanding of the important concepts vital for architects, developers and system administrators considering a Database Sharding implementation.

Keywords

1.???NO SQL

A?NoSQL?or?Not Only SQL?database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability

NoSQL databases are increasingly used in big data and real-time web applications. NoSQL systems are also called "Not only SQL" to emphasize that they may also support SQL-like query languages. Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of availability and partition tolerance. Barriers to the greater adoption of NoSQL stores include the use of low-level query languages, the lack of standardized interfaces, and huge investments in existing SQL. Most NoSQL stores lack true ACID transactions, although a few recent systems, such as FairCom c-tree ACE, Google Spanner and Foundation DB, have made them central to their designs.

2.???Map Reduce:

Map Reduce?is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

A Map Reduce program is composed of a?Map()?procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a?Reduce()?procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "Map Reduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

The model is inspired by the map and reduce functions commonly used in functional programming, although their purpose in the Map Reduce framework is not the same as in their original forms. The key contributions of the Map Reduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine once. As such, a single-threaded implementation of Map Reduce will usually not be faster than a traditional implementation. Only when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the Map Reduce framework come into play, is the use of this model beneficial.

Map Reduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation is Apache Hadoop. The name Map Reduce originally referred to the proprietary Google technology but has since been genericized.

3.???Fan-out Query:

Federations provide a model for partitioning parts of your schema over to multiple member databases for harnessing scalability of many nodes. However applications still need for querying all of the data across federation members. Fan-out is a technique for querying data in your federation, across many federation members. Fan-out queries are much like map/reduce in that it is formed in 2 parts;

Member query?is the piece that is sent over to all members involved in the query and

Summary query?is the query that is the post processing piece to allow condensing the results from the member query to desired final result-set.

With fan-out queries the?member?query is always there but?summary?query may not be needed. For example if you are simply doing DML (we’ll have some examples like data pruning or reference data management etc) or DDL (we’ll look at schema deployment in detail below), fan-out would only have a member query but no?summary?query is needed. It is only when you need post processing, you need the?summary?query.

Last I want to say that fan-out queries are not exotic animals. Member query generates a resultset that is fed into the summary query to allow post processing. This kind of staged processing of results is something that many SQL developers already do in their stored procedure logic. This kind of results pipeline is similar to cases where you choose to use temp tables or table variables for staging the results in similar ways. OR if you have used CTEs (common table expressions) or views, you are using them to stage your processing or abstract processing logic to multiple stages. The shape of fan-out queries is a deliberate breakdown of the query processing based on your federation key but have great similarity to other techniques built into TSQL.

4.???Replication

Replication?in computing involves sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.

One speaks of:

  • data replication?if the same data is stored on multiple storage devices,
  • Computation replication?if the same computing task is executed many times.

A computational task is typically?replicated in space, i.e. executed on separate devices, or it could be?replicated in time, if it is executed repeatedly on a single device. Replication in space or in time is often linked to scheduling algorithms?

The access to a replicated entity is typically uniform with access to a single, non-replicated entity. The replication itself should be transparent to an external user. Also, in a failure scenario, a failover of replicas is hidden as much as possible. The latter refers to data replication with respect to Quality of Service (QoS) aspects.

Computer scientists talk about active and passive replication in systems that replicate data or services:

  • Active replication?is performed by processing the same request at every replica.
  • Passive replication?involves processing each single request on a single replica and then transferring its resultant state to the other replicas.

If at any time one master replica is designated to process all the requests, then we are talking about the?primary-backup?scheme (master-slave?scheme) predominant in high-availability clusters. On the other side, if any replica processes a request and then distributes a new state, then this is a?multi-primary?scheme (called?multi-master?in the database field). In the multi-primary scheme, some form of distributed concurrency control must be used, such as distributed lock manager.

Load balancing differs from task replication, since it distributes a load of different (not the same) computations across machines, and allows a single computation to be dropped in case of failure. Load balancing, however, sometimes uses data replication (especially multi-master replication) internally, to distribute its data among machines.

Backup differs from replication in that it saves a copy of data unchanged for a long period of time. Replicas, on the other hand, undergo frequent updates and quickly lose any historical state. Replication is one of the oldest and most important topics in the overall area of distributed systems.

Whether one replicates data or computation, the objective is to have some group of processes that handle incoming events. If we replicate data, these processes are passive and operate only to maintain the stored data, reply to read requests, and apply updates. When we replicate computation, the usual goal is to provide fault-tolerance. For example, a replicated service might be used to control a telephone switch, with the objective of ensuring that even if the primary controller fails, the backup can take over its functions. But the underlying needs are the same in both cases: by ensuring that the replicas see the same events in equivalent orders, they stay in consistent states and hence any replica can respond to queries.

5.???Consistent hashing

Consistent hashing is a special kind of hashing such that when a hash table is resized and consistent hashing is used, only?keys need to be remapped on average, where?is the number of keys, and?is the number of slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped.

Consistent hashing achieves the same goals as Rendezvous hashing (also called HRW Hashing). The two techniques use different algorithms, and were devised independently and contemporaneously.

Need for consistent hashing

While running collections of caching machines some limitations are experienced. A common way of load balancing?cache machines is to put object?in cache machine number?. But this will not work if a cache machine is added or removed because?changes and every object is hashed to a new location. This can be disastrous since the originating content servers are flooded with requests from the cache machines. Hence consistent hashing is needed to avoid swamping of servers.

Consistent hashing maps objects to the same cache machine, as far as possible. It means when a cache machine is added, it takes its share of objects from all the other cache machines and when it is removed, its objects are shared between the remaining machines.

The main idea behind the consistent hashing algorithm is to associate each cache with one or more hash value intervals where the interval boundaries are determined by calculating the hash of each cache identifier. (The hash function used to define the intervals does not have to be the same function used to hash the cached values. Only the range of the two functions need match.) If the cache is removed its interval is taken over by a cache with an adjacent interval. All the remaining caches are unchanged.

Technique

Consistent hashing is based on mapping each object to a point on the edge of a circle (or equivalently, mapping each object to a real angle). The system maps each available machine (or other storage bucket) to many pseudo-randomly distributed points on the edge of the same circle.

To find where an object should be placed, the system finds the location of that object's key on the edge of the circle; then walks around the circle until falling into the first bucket it encounters (or equivalently, the first available bucket with a higher angle). The result is that each bucket contains all the resources located between its point and the previous bucket point.

If a bucket becomes unavailable (for example because the computer it resides on is not reachable), then the angles it maps to will be removed. Requests for resources that would have mapped to each of those points now map to the next highest point. Since each bucket is associated with many pseudo-randomly distributed points, the resources that were held by that bucket will now map too many different buckets. The items that mapped to the lost bucket must be redistributed among the remaining ones, but values mapping to other buckets will still do so and do not need to be moved.

A similar process occurs when a bucket is added. By adding a bucket point, we make any resources between that and the next smaller angle map to the new bucket. These resources will no longer be associated with the previous bucket, and any value previously stored there will not be found by the selection method described above.

The portion of the keys associated with each bucket can be altered by altering the number of angles that bucket maps to.

6.???Create, read, update and delete(CRUD)

In computer programming, create, read, update and delete (as an acronym CRUD or possibly a Backronym) (Sometimes called SCRUD with an "S" for Search) are the four basic functions of persistent storage.[1] Sometimes CRUD is expanded with the words retrieve instead of read, modify instead of update, or destroy instead of delete. It is also sometimes used to describe user interface conventions that facilitate viewing, searching, and changing information; often using computer-based forms and reports. The term was likely first popularized by James Martin in his 1983 book Managing the Data-base Environment.[2][3] The acronym may be extended to CRUDL to cover listing of large data sets which bring additional complexity such as pagination when the data sets are too large to hold easily in memory.

Another variation of CRUD is BREAD, an acronym for "Browse, Read, Edit, Add, Delete".

What is Cloud computing?

Cloud computing?involves distributed computing over a network, where a program or?application?an application may run on many connected computers at the same time. It specifically refers to a computing hardware machine or group of computing hardware machines commonly referred as a server connected through a communication network such as the Internet, an intranet, a local area network (LAN) or wide area network (WAN). Any individual user who has permission to access the server can use the server's processing power to run an application, store data, or perform any other computing task. Therefore, instead of using a personal computer every-time to run the application, the individual can now run the application from anywhere in the world, as the server provides the processing power to the application and the server is also connected to a network via internet or other connection platforms to be accessed from anywhere [33]. All this has become possible due to increasing computer processing power available to humankind with decrease in cost as stated in Moore's law.

In common usage the term "the cloud" is essentially a metaphor for the Internet.[A1]?Marketers have further popularized the phrase "in the cloud" to refer to software, platforms and infrastructure that are sold "as a service", i.e. remotely through the Internet. Typically, the seller has actual energy-consuming servers which host products and services from a remote location, so end-users don't have to; they can simply log on to the network without installing anything. The major models of cloud computing service are known as software as a service, platform as a service, and infrastructure as a service. These cloud services may be offered in a public, private or hybrid network. Google, Amazon, IBM, Oracle Cloud, Rackspace, Salesforce, Zoho and Microsoft Azure are some well-known cloud vendors.

Network-based services, which appear to be provided by real server hardware and are in fact served up by virtual hardware simulated by software running on one or more real machines,?are often called cloud?computing. Such virtual servers do not physically exist and can therefore be moved around and scaled up or down on the fly without affecting the end user, somewhat like a cloud becoming larger or smaller without being a physical object.

Sharding

Horizontal partitioning is a database design principle whereby?rows?of a database table are held separately, rather than being split into columns (which is what normalization and vertical partitioning do, to differing extents). Each partition forms part of a?shard, which may in turn be located on a separate database server or physical location.

Sharding is the process of splitting up your data so it resides in different tables or often different physical databases. Sharding is helpful when you have some specific set of data that outgrows either storage or reasonable performance within a single database.

There are numerous advantages to this partitioning approach. Since the tables are divided and distributed into multiple servers, the total number of rows in each table in each database is reduced. This reduces index size, which generally improves search performance. A database shard can be placed on separate hardware, and multiple shards can be placed on multiple machines. This enables a distribution of the database over a large number of machines, which means that the database performance can be spread out over multiple machines, greatly improving performance. In addition, if the database shard is based on some real-world segmentation of the data (e.g.,?European customers v. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard.?

In practice, Sharding is far more complex. Although it has been done for a long time by hand-coding (especially where rows have an obvious grouping, as per the example above), this is often inflexible. There is a desire to support Sharding automatically, both in terms of adding code support for it, and for identifying candidates to be sharded separately.?Consistent hashing is one form of automatic Sharding to spread large loads across multiple smaller services and servers.

distributed computing is used to separate load between multiple servers (either for performance or reliability reasons), a shard approach may also be useful.

The Rise of Database Sharding

The concept of Database Sharding has been gaining popularity over the past several years, due to the enormous growth in transaction volume and size of business application databases. This is particularly true for many successful online service providers, Software as a Service (SaaS) companies, and social networking Web sites.

Database Sharding can be simply defined as a “shared-nothing” partitioning scheme for large databases across a number of servers, enabling new levels of database performance and scalability achievable. If you think of broken glass, you can get the concept of sharding – breaking your database down into smaller chunks called “shards” and spreading those across a number of distributed servers.

?

The term “sharding” was coined by Google engineers, and popularized through their publication of the Big Table architecture. However, the concept of “shared-nothing” database partitioning has been around for a decade or more and there have been many implementations over this period, especially high profile in-house built solutions by Internet leaders such as eBay, Amazon, Digg, Flickr, Skype, YouTube, Facebook, Friendster, and Wikipedia.

The focus of this paper is the importance of table sharding, available solutions and some key considerations, the options available for database partitioning, and the key considerations for a successful sharding implementation.

?

What Drives the Need for Database Sharding?

Database Sharding is a highly scalable approach?for improving?to improve the throughput and overall performance of high-transaction, large database-centric business applications. Since the inception of the relational database, application engineers and architects have required ever-increasing performance and capacity, based on the simple observation that business databases generally grow in size over time. Adding to this general trend is the extreme expansion of business data due to the evolution of the Internet economy, the Information?Age?age, and the prevalence of high-volume electronic commerce.

As any experienced database administrator or application developer knows all too well, it is axiomatic that as the size and transaction volume of the database tier incurs linear growth, response times tend to grow logarithmically. This is shown in the following diagram:

No alt text provided for this image

Figure 1. The growth in database transactions and volumes has a large impact on response times.

The reasons for the performance and scalability challenges are inherent to the fundamental design of the database management systems themselves. Databases rely heavily on the primary three components of any computer:

  • CPU
  • Memory
  • Disk

Through benchmark tests that we have performed, we know that each of these elements on a single server can only scale to a given point, and then other measures must be taken. While it is clear that disk I/O is the primary bottleneck, as database management systems have improved they also continue to take greater advantage of CPU and memory. In fact, we have observed that it is the matching of these three factors that determines maximum performance. In other words, you cannot add an unlimited number of CPUs (or processing cores) and see a commensurate increase in performance without also improving the memory capacity and performance of the disk drive subsystem. It is also common to see a diminishing?return?as resources are added to a single database server. These factors are especially true in mixed-use business transaction systems; systems that perform a high volume of read and write transactions, as well as supporting generalized business reporting tasks.

Therefore, as business applications gain sophistication and continue to grow in demand, architects, developers and database administrators have been presented with a constant challenge of maintaining database performance for mission critical systems. This landscape drives the need for Database Sharding.

Database Partitioning Options

It has long been known that database partitioning is the answer to improving the performance and scalability of relational databases. Many techniques have been evolved, including:

  • Master/Slave:?This is the simplest option used by many organizations, with a single Master server for all write (Create Update or Delete, or?CRUD[A2]?) operations, and one or many additional Slave servers that provide read-only operations. The Master uses standard, near-real-time database replication to each of the Slave servers. The Master/Slave model can speed overall performance to a point, allowing read-intensive processing to be offloaded to the Slave servers, but there are several limitations with this approach:?
  • The single Master server for writes is a clear limit to scalability,?and can quickly create a bottleneck.
  • The Master/Slave replication mechanism is “near-real-time,” meaning that the Slave servers are not guaranteed to have a current picture of the data that is in the Master. While this is fine for some applications, if your applications require an up-to-date view, this approach is unacceptable.
  • Many organizations use the Master/Slave approach for high-availability as well, but it suffers from this same limitation given that the Slave servers are not necessarily current with the Master. If a catastrophic failure of the Master server occurs, any transactions that are pending for replication will be lost, a situation that is highly unacceptable for most business transaction applications.
  • Cluster Computing:?Cluster computing utilizes many servers operating in a group, with shared messaging between the nodes of the cluster. Most often this scenario relies on a centralized shared disk facility, typically a Storage Area Network (SAN). Each node in the cluster runs a single instance of the database server, operating in various modes:?
  • For high-availability, many nodes in the cluster can be used for reads, but only one for write (CRUD) operations. This can make reads faster, but write transactions do not see any benefit. If a failure of one node occurs, then another node in the cluster takes over, again continuing to operating against the shared disk facility.?This approach has limited scalability due to the single bottleneck for CRUD operations[A4]?. Even the reads will ultimately hit a performance limit as the centralized shared disk facility can only spread the load so much before diminishing returns are experienced. The read limitations are particularly evident when an application requires complex joins or contains non-optimized SQL statements.
  • More advanced clustering techniques rely on real-time memory replication between nodes, keeping the memory image of nodes in the cluster up to date via a real-time messaging system. This allows each node to operate in both read or write mode, but is ultimately limited by the amount of traffic that can be transmitted between nodes (using a typical network or other high-speed communication mechanism). Therefore, as nodes are added, the communication and memory replication overhead increases geometrically, thus hitting severe scalability limits, often with a relatively small number of nodes. This solution also suffers from the same shared disk limitations of a traditional cluster, given that a growing, single large database has increasingly intensive disk I/O.
  • Table Partitioning:?Many database management systems support table partitioning, where data in a single large table can be split across multiple disks for improved disk I/O utilization. The partitioning is typically done horizontally (separating rows by range across disk partitions), but can be vertical in some systems as well (placing different columns on separate partitions). This approach can help reduce the disk I/O bottleneck for a given table, but can often make joins and other operations slower. Further, since the approach relies on a single server instance of the database management system, all other CPU and memory contention limitations apply, further limiting scalability.
  • Federated Tables:?An offshoot of Table Partitioning is?the?Federated Table approach, where tables can be accessed across multiple servers. This approach is necessarily highly complex to administer, and lacks efficiency as the federated tables must be accessed over the network. This approach may work for some reporting or analytical tasks, but for general read/write transactions it is not a very likely choice.

The common drawback?with each?of these approaches is the reliance on shared facilities and resources.Whether relying on shared memory, centralized disk, or processor capacity they each suffer with scalability limitations, not to mention many other drawbacks, including complex administration, lack of support for critical business requirements, and high availability limitations.

?

Database Sharding, the “Shared-Nothing” Approach

Database Sharding provides a method for scalability across independent servers, each with their own CPU, memory and disk. Contrasted with other traditional methods of achieving greater database performance, it does not suffer from many of the typical limitations posed by these other approaches. The concept of a “shared-nothing” database implementation has been under research or discussion for 15 years, but it appears that the business application market is just now finding the more general need for such capability due to the exponential increase in data volumes over the past several years.

The basic concept of Database Sharding is very straightforward: take a large database, and break it into a number of smaller databases across servers. The concept is illustrated in the following diagram:

No alt text provided for this image

Figure 2. Database Sharding takes large databases and breaks them down into smaller databases.

The obvious advantage of the shared-nothing Database Sharding approach is improved scalability, growing in a near-linear fashion as more servers are added to the network. However, there are several other advantages of smaller databases, which should not be overlooked when considering a sharding solution:

  • Smaller databases are easier to manage.?Production databases must be fully managed for regular backups, database optimization and other common tasks. With a single large database these routine tasks can be very difficult to accomplish, if only in terms of the time window required for completion.?Routine table and index optimizations?can stretch to hours or days, in some cases making regular maintenance infeasible. By using the sharding approach, each individual “shard” can be maintained independently, providing a far more manageable scenario, performing such maintenance tasks in parallel.
  • Smaller databases are faster.?The scalability of sharding is apparent, achieved through the distribution of processing across multiple shards and servers in the network. What is less apparent is the fact that each individual shard database will outperform a single large database due to its smaller size. By hosting each shard database on its own server, the ratio between memory and data on disk is greatly improved, thereby reducing disk I/O. This results in less contention for resources, greater join performance,?faster index searches, and fewer database locks.Therefore, not only can a sharded system scale to new levels of capacity, individual transaction performance is benefited as well.
  • Database Sharding can reduce costs.?Most Database Sharding implementations take advantage of lower-cost open source databases, or can even take advantage of?“workgroup” versions of commercial databases. Additionally, sharding works well with commodity multi-core server hardware, far less expensive than high-end multi-CPU servers and expensive?SANs. The overall reduction in cost due to savings in license fees, software maintenance and hardware investment is substantial, in some cases 70% or more when compared to other solutions.

There is no doubt that Database Sharding is a viable solution for many organizations, supported by the number of large online vendors and SaaS organizations that have implemented the technology (giants such as Amazon, eBay, and of course Google).

Practicalities of Database Sharding

If Database Sharding is highly scalable, less costly, and improves performance, why hasn’t adoption of the technology been more widespread? Is it feasible for your organization?

The reality is that?Database Sharding is a very useful technology, but like other approaches, there are many factors to consider that ensure a successful implementation. Further, there are some limitations and Database Sharding will not work well for every type of business application. This chapter discusses these critical considerations and how they can be addressed.

Database Sharding Challenges

Due to the distributed nature of individual databases, a number of key elements must be taken into account:

  • Reliability.?First and foremost, any production business application must be reliable and fault-tolerant, and cannot be subject to frequent outages. The database tier is often the single most critical element in any reliability design, and therefore an implementation of Database Sharding is no exception. In fact, due to the distributed nature of multiple shard databases, the criticality of a well-designed approach is even greater. To ensure a fault-tolerant and reliable approach, the following items are required:?
  • Automated backups of individual Database Shards.
  • Database Shard redundancy, ensuring at least 2 “live” copies of each shard are available in the event of an outage or server failure. This requires a high-performance, efficient, and reliable replication mechanism.
  • Cost-effective hardware?redundancy, both within and across servers.
  • Automated failover when an outage or server failure occurs.
  • Disaster Recovery site management.
  • Distributed queries.?Many types of queries can be processed far faster using?distributed queries, performing parallel processing of interim results on each shard server. This technique can achieve order-of-magnitude improvements in performance, in many cases 10X or more. To enable distributed queries in a seamless manner for the application, it is important to have a facility that can process a segment of the query on each individual shard, and then consolidate the results into a single result set for the application tier. Common queries that can benefit from distributed processing are:?
  • Aggregation of statistics, requiring a broad sweep of data across the entire system. Such an example is the computation of sales by product, which ordinarily requires evaluation of the entire database.
  • Queries that support comprehensive reports, such as listings of all individual customers that purchased a given product in the last day, week or month.
  • Avoidance of cross-shard joins.?In a sharded system, queries or other statements that use inner-joins that span shards are highly inefficient and difficult to perform. In the majority of cases, it has been found that such inner-joins are not actually required by an application, so long as the correct techniques are applied. The primary technique is the replication of?Global Tables, the relatively static?lookup tables?that are common utilized when joining?to?much larger?primary tables. Tables containing values as Status Codes, Countries, Types, and even Products fall into this category.?What is required is an automated replication mechanism that ensures values for Global Tables are in synch across all shards, minimizing or eliminating the need for cross-shard joins.?
  • Auto-increment key management.?Typical auto-increment functionality provided by database management systems generate a sequential key for each new row inserted into the database. This is fine for a single database application, but when using Database Sharding, keys must be managed across all shards in a coordinated fashion. The requirement here is to provide a seamless, automated method of key generation to the application, one that operates across all shards, ensuring that keys are unique across the entire system.
  • Support for multiple Shard Schemes.?It is important to note that Database Sharding is effective because it offers an application specific technique for massive scalability and performance improvements. In fact it can be said that the degree of effectiveness is directly related to how well the sharding algorithms themselves are tailored to the application problem at hand. What is required is a set of multiple, flexible shard schemes, each designed to address a specific type of application problem. Each scheme has inherent performance and/or application characteristics and advantages when applied to a specific problem domain. In fact, using the wrong shard scheme can actually inhibit performance and the very results you are trying to obtain.?It is also not uncommon for a single application to use more than one shard scheme, each applied to a specific portion of the application to achieve optimum results.Here is a list of some common shard schemes:?
  • Session-based sharding, where each individual user or process interacts with a specific shard for the duration of the user or process session. This is the simplest technique to implement, and adds virtually zero overhead to overall performance, since the sharding decision is made only once per session. Applications which can benefit from this approach are often customer-centric, where all data for a given customer is contained in a single shard, and that is all the data that the customer requires.
  • Transaction-based sharding?determines the shard by examining the first SQL Statement in a given database transaction. This is normally done by evaluating the “shard key” value used in the statement (such as an Order Number), and then directing all other statements in the transaction to the same shard.
  • Statement-based sharding?is the most process intensive of all types, evaluating each individual SQL Statement to determine the appropriate shard to direct it to. Again, evaluation of the shard key value is required. This option is often desirable on high-volume, granular transactions, such as recording phone call records.
  • Determine the optimum method for sharding the data. This is another area that is highly variable, change from application to application. It is closely tied with the selection of the Database Shard Scheme described above. There are numerous methods for deciding how to shard your data, and it’s important to understand your transaction rates, table volumes, key distribution, and other characteristics of your application. This data is required to determine the optimum sharding strategy:
  • Shard by a primary key on a table.?This is the most straightforward option, and easiest to map to a given application. However, this is only effective if your data is reasonably well distributed.?For example, if you elected to shard by Customer ID (and this is a sequential numeric value), and most of your transactions are for new customers, very little if anything will be gained by sharding your database.On the other hand, if you can select a key that does adequately and naturally distribute your transactions, great benefits can be realized.
  • Shard by the modulus of a key value.?This option works in a vast number of cases, by applying the modulus function to the key value, and distributing transactions based on the calculated value. In essence you can predetermine any number of shards, and the modulus function effectively distributes across your shards on a “round-robin" basis, creating a very even distribution of new key values.
  • Maintain a master shard index table. This technique involves using a single master table that maps various values to specific shards. It is very flexible, and meets a wide variety of application situations. However, this option often delivers lower performance as it requires an extra lookup for each sharded SQL Statement.

As you can see, there are many things to consider and many capabilities required in order to ensure that a Database Sharding implementation is successful and effective, delivering on its objectives of providing new levels of scalability and performance in a cost-effective manner.


Shards compared to horizontal partitioning

Horizontal partitioning splits one or more tables by row, usually within a?single?instance of a schema and a database server. It may offer an advantage by reducing index size (and thus search effort) provided that there is some obvious, robust, implicit way to identify in which table a particular row will be found, without first needing to search the index, e.g.,?the classic example of the 'Customers East' and 'Customers West' tables, where their zip code already indicates where they will be found.

Sharding goes beyond this: it partitions the problematic table(s) in the same way, but it does this across potentially?multiple?instances of the schema. The obvious advantage would be that search load for the large partitioned table can now be split across multiple servers (logical or physical), not just multiple indexes on the same logical server.

Splitting shards across multiple isolated instances requires more than simple horizontal partitioning. The hoped-for gains in efficiency would be lost, if querying the database required?both?instances to be queried, just to retrieve a simple dimension table. Beyond partitioning, Sharding thus splits large partition able tables across the servers, while smaller tables are replicated as complete units.

This is also why Sharding is related to a shared nothing architecture—once sharded, each shard can live in a totally separate logical schema instance / physical database server / data center / continent. There is no ongoing need to retain shared access (from between shards) to the other unpartitioned tables in other shards.

This makes replication across multiple servers easy (simple horizontal partitioning does not). It is also useful for worldwide distribution of applications, where communications links between data centers would otherwise be a bottleneck.

There is also a requirement for some notification and replication mechanism between schema instances, so that the unpartitioned tables remain as closely synchronized as the application demands. This is a complex choice in the architecture of sharded systems: approaches range from making these effectively read-only (updates are rare and batched), to dynamically replicated tables (at the cost of reducing some of the distribution benefits of Sharding) and many options in between.


Sharding Algorithms

Distributing the traffic evenly and consistently across all the servers is not difficult if the number of servers in the cluster is constant. But in the real world, you always need to take servers out of service for maintenance. The challenge of a good sharding algorithm is to avoid complete redistribution of requests.

Table below uses a simple modular algorithm.?It divides and keys by number of servers in service, the remainder is the server that takes the request.?

No alt text provided for this image

You can notice that the if we have a 5 server (0-4) cluster and takes server 4 out of service. The requests are completely redistributed to the remaining 4 servers. We are aware of two different algorithms that provide consistency upon node change.

Look-up Ring Algorithm

Form a ring using an array that has significantly larger amount of elements that number of server nodes. For illustration purpose, we use 25 slots for 5 nodes, but the real world ratio should be much higher. The exact number can be determined by running simulation. Then randomly places the server node number in this array. In order to distribute the load evenly in normal mode, the algorithm to populate the ring need to make sure every node get same share of the slots.?

No alt text provided for this image

To determine which node gets which request, we divide the key (650428) by number of slots (25) and take the remainder (3). Use the remainder as index to get the server node number (2) in the array above. That server (2) is designated to serve the request. If the designated server node is out of service (OOS), uses the server node (0) designated by the next slot (4) in the array. The process continues until a server that is in service is found. Table below illustrates the process by showing which server node is selected to serve the request of a set of test keys.

You can see that in the last row, when node 2 is out of service, its load is distributed between node 0, 1 and 3. In the meantime, other requests are continue to be served by the same server node in the normal situation. That eliminates the need to completely redistribute the cache.

No alt text provided for this image

?

The advantage of using this algorithm is that the look-up speed is fast and consistent regardless of the number of server nodes that we have. The disadvantage is the need to maintain the look-up ring especially when new nodes are being added to the cluster.

Key + Node Hash Algorithm

This algorithm is to use a good hash algorithm, commonly MD5 or SHA-1. For each request, compute a value for each active node. The value is the hash of a string consisting of key and node (node number, node name or anything that uniquely identifies the node). The server yielded the largest hash value takes the request. Table below demonstrate the node selection process for a set of test keys. The hash algorithm used here is for illustration purpose only, it's neither MD5, nor SHA-1.

In the last row, you can see that when node 2 is out of service, it's load is distributed between node 0, 1 and 4. In the meantime, other requests are continue to be served by the same server node in the normal situation. That eliminates the need to completely redistribute the cache.?

Support for shard

Apache HBase

HBase supports automatic Sharding.?

CUBRID

CUBRID supports Sharding from version 9.0

DB Shards

Code Futures dB Shards is a product dedicated to database shards.?

Extreme Scale

Extreme Scale is a cross-process in-memory key/value data store (a variety of NoSQL data store). It uses sharding to achieve scalability across processes for both data and Map Reduce-style parallel processing.?

Hibernate ORM

Hibernate Shards provides support for shards, although there has been little activity since 2007.?

IBM Informix

IBM supports sharding in Informix since version 12.1 xC1 as part of the MACH11 technology. Informix 12.10 xC2 added full compatibility with Mongo DB drivers, allowing the mix of regular relational tables with NoSQL collections, still supporting sharding, failover and ACID properties.?

Mongo DB

Mongo DB supports sharding from version 1.6

MySQL Cluster

Auto-Sharding: Database is automatically and transparently partitioned across low cost commodity nodes, allowing scale-out of read and write queries, without requiring changes to the application.?

Orient DB

Orient DB supports sharding from version 1.7

Plugin for Grails

Grails supports sharding using the Grails Sharding Plugin.?

Ruby Active Record

Octopus works as a database sharding and replication extension for the Active Record ORM.

Scale Base’s Data Traffic Manager

Scale Base’s Data Traffic Manager is a software product dedicated to automating MySQL database sharding without requiring changes to applications.?

Solr Search Server

Solr enterprise search server provides sharding capabilities.

Spanner

Spanner is Google's global scale distributed database that shards data across multiple Paxos state machines to scale to "millions of machines across hundreds of datacenters and trillions of database rows".

SQLAlchemy ORM

SQLAlchemy is an object-relational mapper for the Python programming language that provides sharding capabilities.?

SQL Azure

Microsoft supported sharding in SQL Azure through "Federations". In 2014, support for the under-used Federations was dropped. Although Azure users were using sharding, they were choosing to implement it through application-specific custom approaches, rather than the automatic approach of Federations.?

When Database Sharding is Appropriate

Database Sharding is an excellent fit for many types of business applications, those with general purpose database requirements. It can also be used effectively for Data Warehousing applications, and as there are many available products and technologies to accomplish this, we will not focus on this element here.

The general purpose database requirements that are a fit for sharding include:

  • High-transaction database applications
  • Mixed workload database usage?
  • Frequent reads, including complex queries and joins
  • Write-intensive transactions (CRUD statements, including INSERT, UPDATE, DELETE)
  • Contention for common tables and/or rows
  • General Business Reporting?
  • Typical “repeating segment” report generation
  • Some data analysis (mixed with other workloads)

To determine if Database Sharding is applicable to your specific application or environment, the most important thing to evaluate is how well your database schema lends itself to sharding. In essence, Database Sharding is a method of “horizontal” portioning, meaning that database rows (as opposed to columns) for a single schema table are distributed across multiple shards. To understand the characteristics of how well sharding fits a given situation, here are the important things to determine:

  • Identify all transaction-intensive tables in your schema.
  • Determine the transaction volume your database is currently handling (or is expected to handle).
  • Identify all common SQL statements (SELECT, INSERT, UPDATE, DELETE), and the volumes associated with each.
  • Develop an understanding of your “table hierarchy” contained in your schema; in other words the main parent-child relationships.
  • Determine the “key distribution” for transactions on high-volume tables, to determine if they are evenly spread or are concentrated in narrow ranges.

With this information, you can rapidly gain an assessment of the value and applicability of sharding to your application. As an example, here is a simple Bookstore schema showing how the data can be sharded:

No alt text provided for this image

Figure 3. Example Bookstore schema showing how data is sharded.

In the Bookstore example, the Primary Shard Table is the ‘customer’ entity. This is the table that is used to shard the data. The ‘customer’ table is the parent of the shard hierarchy, with the ‘customer_order’ and ‘order_item’ entities as child tables. The data is sharded by the ‘customer.id’ attribute, and all related rows in the child tables associated with a given ‘customer.id’ are sharded as well. The Global Tables are the common lookup tables, which have relatively low activity, and these tables are replicated to all shards to avoid cross-shard joins.

While this example is very basic, it does provide the basic considerations for determining how to shard a given database application. By using this evaluation approach, you can determine if sharding is applicable to your particular environment, and what benefits can be realized by implementing Database Sharding.


Disadvantages of sharding

Sharding a database table before it has been optimized locally causes premature complexity. Sharding should be used only when all other options for optimization are inadequate. The introduced complexity of database sharding causes the following potential problems:

  • Increased complexity of SQL?- Increased bugs because the developers have to write more complicated SQL to handle sharding logic.
  • Sharding introduces complexity?- The sharding software that partitions, balances, coordinates, and ensures integrity can fail.
  • Single point of failure?- Corruption of one shard due to network/hardware/systems problems causes failure of the entire table.
  • Failover servers more complex?- Failover servers must themselves have copies of the fleets of database shards.
  • Backups more complex?- Database backups of the individual shards must be coordinated with the backups of the other shards.
  • Operational complexity added?- Adding/removing indexes, adding/deleting columns, modifying the schema becomes much more difficult.

These historical complications of do-it-yourself sharding are now being addressed by independent software vendors who provide autos sharding solutions.

CONCLUSION

This paper has covered an overview of Database Sharding, including a discussion of the challenges involved, and a fundamental approach to implementing a sharding solution. Database Sharding has been proven in many large organizations, and may very well be applicable to your specific application problems. When implemented properly, Database Sharding can in fact deliver on the objective of cost-effective, near-linear scalability for high-volume business transaction applications.



References

Specific references:

1.???Google spotlights data center inner workings | Tech news blog - CNET News.com

2.???Map Reduce: Simplied Data Processing on Large Clusters

3.???"Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages." -"Map Reduce: Simplified Data Processing on Large Clusters", by Jeffrey Dean and Sanjay Ghemawat; from Google Research

4.???L?mmel, R. (2008). "Google's Map?Reduce?programming model — revisited".?Science of Computer Programming?70: 1. doi:10.1016/j.scico.2007.07.001.?edit

5.???Czajkowski, Grzegorz,; Marián Dvorsky; Jerry Zhao; Michael Conley. "Sorting Petabytes with Map Reduce - The Next Episode". Google. Retrieved 7 April 2014.

6.???Example: Count word occurrences. Research.google.com. Retrieved on 2013-09-18.

7.???Ullman, J. D. (2012). "Designing good Map?Reduce?algorithms".?XRDS: Crossroads, the ACM Magazine for Students?(Association for Computing Machinery)?19: 30. doi:10.1145/2331042.2331053.?edit

8.???Bosagh Zadeh, Reza; Carlsson, Gunnar.?Dimension Independent Matrix Square Using Map Reduce. Retrieved 12 July 2014.

9.???Cheng-Tao Chu; Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Ng, and Kunle Olukotun. "Map-Reduce for Machine Learning on Multicore". NIPS 2006.

10.?????????????????Ranger, C.; Raghuraman, R.; Penmetsa, A.; Bradski, G.; Kozyrakis, C. (2007). "Evaluating MapReduce for Multi-core and Multiprocessor Systems".?2007 IEEE 13th International Symposium on High Performance Computer Architecture. p.?13. doi:10.1109/HPCA.2007.346181. ISBN?1-4244-0804-0.

11.?????????????????He, B.; Fang, W.; Luo, Q.; Govindaraju, N. K.; Wang, T. (2008). "Mars: a MapReduce framework on graphics processors".?Proceedings of the 17th international conference on Parallel architectures and compilation techniques - PACT '08. p.?260. doi:10.1145/1454115.1454152. ISBN?9781605582825.?edit

12.?????????????????Chen, R.; Chen, H.; Zang, B. (2010). "Tiled-Map Reduce: optimizing resource usages of data-parallel applications on multicore with tiling".?Proceedings of the 19th international conference on Parallel architectures and compilation techniques - PACT '10. p.?523. doi:10.1145/1854273.1854337. ISBN?9781450301787.?edit

13.?????????????????Tang, B.; Moca, M.; Chevalier, S.; He, H.; Fedak, G. (2010). "Towards MapReduce for Desktop Grid Computing".?2010 International Conference on P2P, Parallel, Grid, Cloud and Internet Computing. p.?193. doi:10.1109/3PGCIC.2010.33. ISBN?978-1-4244-8538-3.?edit

14.?????????????????Lin, H.; Ma, X.; Archuleta, J.; Feng, W. C.; Gardner, M.; Zhang, Z. (2010). "MOON: Map Reduce on Opportunistic environments".?Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing - HPDC '10. p.?95. doi:10.1145/1851476.1851489. ISBN?9781605589428.?edit

15.?????????????????Marozzo, F.; Talia, D.; Trunfio, P. (2012). "P2P-MapReduce: Parallel data processing in dynamic Cloud environments".?Journal of Computer and System Sciences?78?(5): 1382. Doi:10.1016/j.jcss.2011.12.021.?edit

16.?????????????????Dou, A.; Kalogeraki, V.; Gunopulos, D.; Mielikainen, T.; Tuulos, V. H. (2010). "Misco: a MapReduce framework for mobile systems".?Proceedings of the 3rd International Conference on PErvasive Technologies Related to Assistive Environments - PETRA '10. p.?1. doi:10.1145/1839294.1839332. ISBN?9781450300711.?edit

17.?????????????????"How Google Works". baselinemag.com. "As of October, Google was running about 3,000 computing jobs per day through MapReduce, representing thousands of machine-days, according to a presentation by Dean. Among other things, these batch routines analyze the latest Web pages and update Google's indexes."

18.?????????????????"Database Experts Jump the Map Reduce Shark".

19.?????????????????David DeWitt; Michael Stonebreaker. "Map Reduce: A major step backwards". Craig-henderson.blogspot.com. Retrieved 2008-08-27.

20.?????????????????"Apache Hive - Index of - Apache Software Foundation".

21.?????????????????Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He and Xiaodong Zhang. "YSmart: Yet Another SQL-to-Map Reduce Translator" (PDF).

22.?????????????????"HBase - HBase Home - Apache Software Foundation".

23.?????????????????"Big table: A Distributed Storage System for Structured Data" (PDF).

24.?????????????????Greg Jorgensen. "Relational Database Experts Jump The Map Reduce Shark". Typicalprogrammer.com. Retrieved 2009-11-11.

25.?????????????????Andrew Pavlo; E. Paulson, A. Rasin, D. J. Abadi, D. J. Dewitt, S. Madden, and M. Stonebreaker. "A Comparison of Approaches to Large-Scale Data Analysis". Brown University. Retrieved 2010-01-11.

26.?????????????????US Patent 7,650,331: "System and method for efficient large-scale data processing "

27.?????????????????Curt Monash. "More patent nonsense — Google Map Reduce". Dbms2.com. Retrieved 2010-03-07.

28.?????????????????"Cluster point XML database". clusterpoint.com.

29.?????????????????"Mongo DB NoSQL database". 10gen.com.

30.?????????????????Zaharia, Matei; Chowdhury, Mosharaf; Franklin, Michael; Shenker, Scott; Stoica, Ion (June 2010). "Spark: Cluster Computing with Working Sets". HotCloud 2010.

General references:

  • Dean, Jeffrey & Ghemawat, Sanjay (2004). "Map Reduce: Simplified Data Processing on Large Clusters". Retrieved Nov. 23, 2011.
  • Matt Williams (2009). "Understanding Map-Reduce". Retrieved Apr. 13, 2011.
  • What is database replication? - Definition from WhatIs.com". Searchsqlserver.techtarget.com. Retrieved 2014-01-12.
  • Mansouri, Najme, GholamHosein Dastghaibyfard, and Ehsan Mansouri. "Combination of data replication and scheduling algorithm for improving data availability in Data Grids." Journal of Network and Computer Applications (2013)
  • Andronikou, K. Mamouras, K. Tserpes, D. Kyriazis, T. Varvarigou,?Dynamic QoS-aware Data Replication in Grid Environments, Elsevier Future Generation Computer Systems - The International Journal of Grid Computing and eScience, 2012
  • Karger, D.; Lehman, E.; Leighton, T.; Panigrahy, R.; Levine, M.; Lewin, D. (1997). "Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web".?Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing. ACM Press New York, NY, USA. pp.?654–663. doi:10.1145/258533.258660.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了