Table Sharding algorithms in Cloud computing and their optimization
Summary
Database Sharding is an enabling technology for achieving high levels of scalability in transactional database applications. Google is credited with inventing the sharding terminology, although Database Sharding has also been successfully implemented by many other high profile online service providers such as Amazon and eBay.
Many Database Sharding implementations take advantage of open?source databases deployed on commodity multi-core server hardware, which results in the most cost-effective method of achieving database application scalability.
There are many techniques and approaches for accomplishing Database Sharding, as well as a number of factors that a successful implementation must consider.
This paper explores the various alternatives and issues, providing a foundational understanding of the important concepts vital for architects, developers and system administrators considering a Database Sharding implementation.
Keywords
1.???NO SQL
A?NoSQL?or?Not Only SQL?database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability
NoSQL databases are increasingly used in big data and real-time web applications. NoSQL systems are also called "Not only SQL" to emphasize that they may also support SQL-like query languages. Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of availability and partition tolerance. Barriers to the greater adoption of NoSQL stores include the use of low-level query languages, the lack of standardized interfaces, and huge investments in existing SQL. Most NoSQL stores lack true ACID transactions, although a few recent systems, such as FairCom c-tree ACE, Google Spanner and Foundation DB, have made them central to their designs.
2.???Map Reduce:
Map Reduce?is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
A Map Reduce program is composed of a?Map()?procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a?Reduce()?procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "Map Reduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.
The model is inspired by the map and reduce functions commonly used in functional programming, although their purpose in the Map Reduce framework is not the same as in their original forms. The key contributions of the Map Reduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine once. As such, a single-threaded implementation of Map Reduce will usually not be faster than a traditional implementation. Only when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the Map Reduce framework come into play, is the use of this model beneficial.
Map Reduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation is Apache Hadoop. The name Map Reduce originally referred to the proprietary Google technology but has since been genericized.
3.???Fan-out Query:
Federations provide a model for partitioning parts of your schema over to multiple member databases for harnessing scalability of many nodes. However applications still need for querying all of the data across federation members. Fan-out is a technique for querying data in your federation, across many federation members. Fan-out queries are much like map/reduce in that it is formed in 2 parts;
Member query?is the piece that is sent over to all members involved in the query and
Summary query?is the query that is the post processing piece to allow condensing the results from the member query to desired final result-set.
With fan-out queries the?member?query is always there but?summary?query may not be needed. For example if you are simply doing DML (we’ll have some examples like data pruning or reference data management etc) or DDL (we’ll look at schema deployment in detail below), fan-out would only have a member query but no?summary?query is needed. It is only when you need post processing, you need the?summary?query.
Last I want to say that fan-out queries are not exotic animals. Member query generates a resultset that is fed into the summary query to allow post processing. This kind of staged processing of results is something that many SQL developers already do in their stored procedure logic. This kind of results pipeline is similar to cases where you choose to use temp tables or table variables for staging the results in similar ways. OR if you have used CTEs (common table expressions) or views, you are using them to stage your processing or abstract processing logic to multiple stages. The shape of fan-out queries is a deliberate breakdown of the query processing based on your federation key but have great similarity to other techniques built into TSQL.
4.???Replication
Replication?in computing involves sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.
One speaks of:
A computational task is typically?replicated in space, i.e. executed on separate devices, or it could be?replicated in time, if it is executed repeatedly on a single device. Replication in space or in time is often linked to scheduling algorithms?
The access to a replicated entity is typically uniform with access to a single, non-replicated entity. The replication itself should be transparent to an external user. Also, in a failure scenario, a failover of replicas is hidden as much as possible. The latter refers to data replication with respect to Quality of Service (QoS) aspects.
Computer scientists talk about active and passive replication in systems that replicate data or services:
If at any time one master replica is designated to process all the requests, then we are talking about the?primary-backup?scheme (master-slave?scheme) predominant in high-availability clusters. On the other side, if any replica processes a request and then distributes a new state, then this is a?multi-primary?scheme (called?multi-master?in the database field). In the multi-primary scheme, some form of distributed concurrency control must be used, such as distributed lock manager.
Load balancing differs from task replication, since it distributes a load of different (not the same) computations across machines, and allows a single computation to be dropped in case of failure. Load balancing, however, sometimes uses data replication (especially multi-master replication) internally, to distribute its data among machines.
Backup differs from replication in that it saves a copy of data unchanged for a long period of time. Replicas, on the other hand, undergo frequent updates and quickly lose any historical state. Replication is one of the oldest and most important topics in the overall area of distributed systems.
Whether one replicates data or computation, the objective is to have some group of processes that handle incoming events. If we replicate data, these processes are passive and operate only to maintain the stored data, reply to read requests, and apply updates. When we replicate computation, the usual goal is to provide fault-tolerance. For example, a replicated service might be used to control a telephone switch, with the objective of ensuring that even if the primary controller fails, the backup can take over its functions. But the underlying needs are the same in both cases: by ensuring that the replicas see the same events in equivalent orders, they stay in consistent states and hence any replica can respond to queries.
5.???Consistent hashing
Consistent hashing is a special kind of hashing such that when a hash table is resized and consistent hashing is used, only?keys need to be remapped on average, where?is the number of keys, and?is the number of slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped.
Consistent hashing achieves the same goals as Rendezvous hashing (also called HRW Hashing). The two techniques use different algorithms, and were devised independently and contemporaneously.
Need for consistent hashing
While running collections of caching machines some limitations are experienced. A common way of load balancing?cache machines is to put object?in cache machine number?. But this will not work if a cache machine is added or removed because?changes and every object is hashed to a new location. This can be disastrous since the originating content servers are flooded with requests from the cache machines. Hence consistent hashing is needed to avoid swamping of servers.
Consistent hashing maps objects to the same cache machine, as far as possible. It means when a cache machine is added, it takes its share of objects from all the other cache machines and when it is removed, its objects are shared between the remaining machines.
The main idea behind the consistent hashing algorithm is to associate each cache with one or more hash value intervals where the interval boundaries are determined by calculating the hash of each cache identifier. (The hash function used to define the intervals does not have to be the same function used to hash the cached values. Only the range of the two functions need match.) If the cache is removed its interval is taken over by a cache with an adjacent interval. All the remaining caches are unchanged.
Technique
Consistent hashing is based on mapping each object to a point on the edge of a circle (or equivalently, mapping each object to a real angle). The system maps each available machine (or other storage bucket) to many pseudo-randomly distributed points on the edge of the same circle.
To find where an object should be placed, the system finds the location of that object's key on the edge of the circle; then walks around the circle until falling into the first bucket it encounters (or equivalently, the first available bucket with a higher angle). The result is that each bucket contains all the resources located between its point and the previous bucket point.
If a bucket becomes unavailable (for example because the computer it resides on is not reachable), then the angles it maps to will be removed. Requests for resources that would have mapped to each of those points now map to the next highest point. Since each bucket is associated with many pseudo-randomly distributed points, the resources that were held by that bucket will now map too many different buckets. The items that mapped to the lost bucket must be redistributed among the remaining ones, but values mapping to other buckets will still do so and do not need to be moved.
A similar process occurs when a bucket is added. By adding a bucket point, we make any resources between that and the next smaller angle map to the new bucket. These resources will no longer be associated with the previous bucket, and any value previously stored there will not be found by the selection method described above.
The portion of the keys associated with each bucket can be altered by altering the number of angles that bucket maps to.
6.???Create, read, update and delete(CRUD)
In computer programming, create, read, update and delete (as an acronym CRUD or possibly a Backronym) (Sometimes called SCRUD with an "S" for Search) are the four basic functions of persistent storage.[1] Sometimes CRUD is expanded with the words retrieve instead of read, modify instead of update, or destroy instead of delete. It is also sometimes used to describe user interface conventions that facilitate viewing, searching, and changing information; often using computer-based forms and reports. The term was likely first popularized by James Martin in his 1983 book Managing the Data-base Environment.[2][3] The acronym may be extended to CRUDL to cover listing of large data sets which bring additional complexity such as pagination when the data sets are too large to hold easily in memory.
Another variation of CRUD is BREAD, an acronym for "Browse, Read, Edit, Add, Delete".
What is Cloud computing?
Cloud computing?involves distributed computing over a network, where a program or?application?an application may run on many connected computers at the same time. It specifically refers to a computing hardware machine or group of computing hardware machines commonly referred as a server connected through a communication network such as the Internet, an intranet, a local area network (LAN) or wide area network (WAN). Any individual user who has permission to access the server can use the server's processing power to run an application, store data, or perform any other computing task. Therefore, instead of using a personal computer every-time to run the application, the individual can now run the application from anywhere in the world, as the server provides the processing power to the application and the server is also connected to a network via internet or other connection platforms to be accessed from anywhere [33]. All this has become possible due to increasing computer processing power available to humankind with decrease in cost as stated in Moore's law.
In common usage the term "the cloud" is essentially a metaphor for the Internet.[A1]?Marketers have further popularized the phrase "in the cloud" to refer to software, platforms and infrastructure that are sold "as a service", i.e. remotely through the Internet. Typically, the seller has actual energy-consuming servers which host products and services from a remote location, so end-users don't have to; they can simply log on to the network without installing anything. The major models of cloud computing service are known as software as a service, platform as a service, and infrastructure as a service. These cloud services may be offered in a public, private or hybrid network. Google, Amazon, IBM, Oracle Cloud, Rackspace, Salesforce, Zoho and Microsoft Azure are some well-known cloud vendors.
Network-based services, which appear to be provided by real server hardware and are in fact served up by virtual hardware simulated by software running on one or more real machines,?are often called cloud?computing. Such virtual servers do not physically exist and can therefore be moved around and scaled up or down on the fly without affecting the end user, somewhat like a cloud becoming larger or smaller without being a physical object.
Sharding
Horizontal partitioning is a database design principle whereby?rows?of a database table are held separately, rather than being split into columns (which is what normalization and vertical partitioning do, to differing extents). Each partition forms part of a?shard, which may in turn be located on a separate database server or physical location.
Sharding is the process of splitting up your data so it resides in different tables or often different physical databases. Sharding is helpful when you have some specific set of data that outgrows either storage or reasonable performance within a single database.
There are numerous advantages to this partitioning approach. Since the tables are divided and distributed into multiple servers, the total number of rows in each table in each database is reduced. This reduces index size, which generally improves search performance. A database shard can be placed on separate hardware, and multiple shards can be placed on multiple machines. This enables a distribution of the database over a large number of machines, which means that the database performance can be spread out over multiple machines, greatly improving performance. In addition, if the database shard is based on some real-world segmentation of the data (e.g.,?European customers v. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard.?
In practice, Sharding is far more complex. Although it has been done for a long time by hand-coding (especially where rows have an obvious grouping, as per the example above), this is often inflexible. There is a desire to support Sharding automatically, both in terms of adding code support for it, and for identifying candidates to be sharded separately.?Consistent hashing is one form of automatic Sharding to spread large loads across multiple smaller services and servers.
distributed computing is used to separate load between multiple servers (either for performance or reliability reasons), a shard approach may also be useful.
The Rise of Database Sharding
The concept of Database Sharding has been gaining popularity over the past several years, due to the enormous growth in transaction volume and size of business application databases. This is particularly true for many successful online service providers, Software as a Service (SaaS) companies, and social networking Web sites.
Database Sharding can be simply defined as a “shared-nothing” partitioning scheme for large databases across a number of servers, enabling new levels of database performance and scalability achievable. If you think of broken glass, you can get the concept of sharding – breaking your database down into smaller chunks called “shards” and spreading those across a number of distributed servers.
?
The term “sharding” was coined by Google engineers, and popularized through their publication of the Big Table architecture. However, the concept of “shared-nothing” database partitioning has been around for a decade or more and there have been many implementations over this period, especially high profile in-house built solutions by Internet leaders such as eBay, Amazon, Digg, Flickr, Skype, YouTube, Facebook, Friendster, and Wikipedia.
The focus of this paper is the importance of table sharding, available solutions and some key considerations, the options available for database partitioning, and the key considerations for a successful sharding implementation.
?
What Drives the Need for Database Sharding?
Database Sharding is a highly scalable approach?for improving?to improve the throughput and overall performance of high-transaction, large database-centric business applications. Since the inception of the relational database, application engineers and architects have required ever-increasing performance and capacity, based on the simple observation that business databases generally grow in size over time. Adding to this general trend is the extreme expansion of business data due to the evolution of the Internet economy, the Information?Age?age, and the prevalence of high-volume electronic commerce.
As any experienced database administrator or application developer knows all too well, it is axiomatic that as the size and transaction volume of the database tier incurs linear growth, response times tend to grow logarithmically. This is shown in the following diagram:
Figure 1. The growth in database transactions and volumes has a large impact on response times.
The reasons for the performance and scalability challenges are inherent to the fundamental design of the database management systems themselves. Databases rely heavily on the primary three components of any computer:
Through benchmark tests that we have performed, we know that each of these elements on a single server can only scale to a given point, and then other measures must be taken. While it is clear that disk I/O is the primary bottleneck, as database management systems have improved they also continue to take greater advantage of CPU and memory. In fact, we have observed that it is the matching of these three factors that determines maximum performance. In other words, you cannot add an unlimited number of CPUs (or processing cores) and see a commensurate increase in performance without also improving the memory capacity and performance of the disk drive subsystem. It is also common to see a diminishing?return?as resources are added to a single database server. These factors are especially true in mixed-use business transaction systems; systems that perform a high volume of read and write transactions, as well as supporting generalized business reporting tasks.
Therefore, as business applications gain sophistication and continue to grow in demand, architects, developers and database administrators have been presented with a constant challenge of maintaining database performance for mission critical systems. This landscape drives the need for Database Sharding.
Database Partitioning Options
It has long been known that database partitioning is the answer to improving the performance and scalability of relational databases. Many techniques have been evolved, including:
The common drawback?with each?of these approaches is the reliance on shared facilities and resources.Whether relying on shared memory, centralized disk, or processor capacity they each suffer with scalability limitations, not to mention many other drawbacks, including complex administration, lack of support for critical business requirements, and high availability limitations.
?
Database Sharding, the “Shared-Nothing” Approach
Database Sharding provides a method for scalability across independent servers, each with their own CPU, memory and disk. Contrasted with other traditional methods of achieving greater database performance, it does not suffer from many of the typical limitations posed by these other approaches. The concept of a “shared-nothing” database implementation has been under research or discussion for 15 years, but it appears that the business application market is just now finding the more general need for such capability due to the exponential increase in data volumes over the past several years.
The basic concept of Database Sharding is very straightforward: take a large database, and break it into a number of smaller databases across servers. The concept is illustrated in the following diagram:
Figure 2. Database Sharding takes large databases and breaks them down into smaller databases.
The obvious advantage of the shared-nothing Database Sharding approach is improved scalability, growing in a near-linear fashion as more servers are added to the network. However, there are several other advantages of smaller databases, which should not be overlooked when considering a sharding solution:
There is no doubt that Database Sharding is a viable solution for many organizations, supported by the number of large online vendors and SaaS organizations that have implemented the technology (giants such as Amazon, eBay, and of course Google).
Practicalities of Database Sharding
If Database Sharding is highly scalable, less costly, and improves performance, why hasn’t adoption of the technology been more widespread? Is it feasible for your organization?
The reality is that?Database Sharding is a very useful technology, but like other approaches, there are many factors to consider that ensure a successful implementation. Further, there are some limitations and Database Sharding will not work well for every type of business application. This chapter discusses these critical considerations and how they can be addressed.
Database Sharding Challenges
Due to the distributed nature of individual databases, a number of key elements must be taken into account:
As you can see, there are many things to consider and many capabilities required in order to ensure that a Database Sharding implementation is successful and effective, delivering on its objectives of providing new levels of scalability and performance in a cost-effective manner.
Shards compared to horizontal partitioning
Horizontal partitioning splits one or more tables by row, usually within a?single?instance of a schema and a database server. It may offer an advantage by reducing index size (and thus search effort) provided that there is some obvious, robust, implicit way to identify in which table a particular row will be found, without first needing to search the index, e.g.,?the classic example of the 'Customers East' and 'Customers West' tables, where their zip code already indicates where they will be found.
Sharding goes beyond this: it partitions the problematic table(s) in the same way, but it does this across potentially?multiple?instances of the schema. The obvious advantage would be that search load for the large partitioned table can now be split across multiple servers (logical or physical), not just multiple indexes on the same logical server.
Splitting shards across multiple isolated instances requires more than simple horizontal partitioning. The hoped-for gains in efficiency would be lost, if querying the database required?both?instances to be queried, just to retrieve a simple dimension table. Beyond partitioning, Sharding thus splits large partition able tables across the servers, while smaller tables are replicated as complete units.
This is also why Sharding is related to a shared nothing architecture—once sharded, each shard can live in a totally separate logical schema instance / physical database server / data center / continent. There is no ongoing need to retain shared access (from between shards) to the other unpartitioned tables in other shards.
This makes replication across multiple servers easy (simple horizontal partitioning does not). It is also useful for worldwide distribution of applications, where communications links between data centers would otherwise be a bottleneck.
There is also a requirement for some notification and replication mechanism between schema instances, so that the unpartitioned tables remain as closely synchronized as the application demands. This is a complex choice in the architecture of sharded systems: approaches range from making these effectively read-only (updates are rare and batched), to dynamically replicated tables (at the cost of reducing some of the distribution benefits of Sharding) and many options in between.
领英推荐
Sharding Algorithms
Distributing the traffic evenly and consistently across all the servers is not difficult if the number of servers in the cluster is constant. But in the real world, you always need to take servers out of service for maintenance. The challenge of a good sharding algorithm is to avoid complete redistribution of requests.
Table below uses a simple modular algorithm.?It divides and keys by number of servers in service, the remainder is the server that takes the request.?
You can notice that the if we have a 5 server (0-4) cluster and takes server 4 out of service. The requests are completely redistributed to the remaining 4 servers. We are aware of two different algorithms that provide consistency upon node change.
Look-up Ring Algorithm
Form a ring using an array that has significantly larger amount of elements that number of server nodes. For illustration purpose, we use 25 slots for 5 nodes, but the real world ratio should be much higher. The exact number can be determined by running simulation. Then randomly places the server node number in this array. In order to distribute the load evenly in normal mode, the algorithm to populate the ring need to make sure every node get same share of the slots.?
To determine which node gets which request, we divide the key (650428) by number of slots (25) and take the remainder (3). Use the remainder as index to get the server node number (2) in the array above. That server (2) is designated to serve the request. If the designated server node is out of service (OOS), uses the server node (0) designated by the next slot (4) in the array. The process continues until a server that is in service is found. Table below illustrates the process by showing which server node is selected to serve the request of a set of test keys.
You can see that in the last row, when node 2 is out of service, its load is distributed between node 0, 1 and 3. In the meantime, other requests are continue to be served by the same server node in the normal situation. That eliminates the need to completely redistribute the cache.
?
The advantage of using this algorithm is that the look-up speed is fast and consistent regardless of the number of server nodes that we have. The disadvantage is the need to maintain the look-up ring especially when new nodes are being added to the cluster.
Key + Node Hash Algorithm
This algorithm is to use a good hash algorithm, commonly MD5 or SHA-1. For each request, compute a value for each active node. The value is the hash of a string consisting of key and node (node number, node name or anything that uniquely identifies the node). The server yielded the largest hash value takes the request. Table below demonstrate the node selection process for a set of test keys. The hash algorithm used here is for illustration purpose only, it's neither MD5, nor SHA-1.
In the last row, you can see that when node 2 is out of service, it's load is distributed between node 0, 1 and 4. In the meantime, other requests are continue to be served by the same server node in the normal situation. That eliminates the need to completely redistribute the cache.?
Support for shard
Apache HBase
HBase supports automatic Sharding.?
CUBRID
CUBRID supports Sharding from version 9.0
DB Shards
Code Futures dB Shards is a product dedicated to database shards.?
Extreme Scale
Extreme Scale is a cross-process in-memory key/value data store (a variety of NoSQL data store). It uses sharding to achieve scalability across processes for both data and Map Reduce-style parallel processing.?
Hibernate ORM
Hibernate Shards provides support for shards, although there has been little activity since 2007.?
IBM Informix
IBM supports sharding in Informix since version 12.1 xC1 as part of the MACH11 technology. Informix 12.10 xC2 added full compatibility with Mongo DB drivers, allowing the mix of regular relational tables with NoSQL collections, still supporting sharding, failover and ACID properties.?
Mongo DB
Mongo DB supports sharding from version 1.6
MySQL Cluster
Auto-Sharding: Database is automatically and transparently partitioned across low cost commodity nodes, allowing scale-out of read and write queries, without requiring changes to the application.?
Orient DB
Orient DB supports sharding from version 1.7
Plugin for Grails
Grails supports sharding using the Grails Sharding Plugin.?
Ruby Active Record
Octopus works as a database sharding and replication extension for the Active Record ORM.
Scale Base’s Data Traffic Manager
Scale Base’s Data Traffic Manager is a software product dedicated to automating MySQL database sharding without requiring changes to applications.?
Solr Search Server
Solr enterprise search server provides sharding capabilities.
Spanner
Spanner is Google's global scale distributed database that shards data across multiple Paxos state machines to scale to "millions of machines across hundreds of datacenters and trillions of database rows".
SQLAlchemy ORM
SQLAlchemy is an object-relational mapper for the Python programming language that provides sharding capabilities.?
SQL Azure
Microsoft supported sharding in SQL Azure through "Federations". In 2014, support for the under-used Federations was dropped. Although Azure users were using sharding, they were choosing to implement it through application-specific custom approaches, rather than the automatic approach of Federations.?
When Database Sharding is Appropriate
Database Sharding is an excellent fit for many types of business applications, those with general purpose database requirements. It can also be used effectively for Data Warehousing applications, and as there are many available products and technologies to accomplish this, we will not focus on this element here.
The general purpose database requirements that are a fit for sharding include:
To determine if Database Sharding is applicable to your specific application or environment, the most important thing to evaluate is how well your database schema lends itself to sharding. In essence, Database Sharding is a method of “horizontal” portioning, meaning that database rows (as opposed to columns) for a single schema table are distributed across multiple shards. To understand the characteristics of how well sharding fits a given situation, here are the important things to determine:
With this information, you can rapidly gain an assessment of the value and applicability of sharding to your application. As an example, here is a simple Bookstore schema showing how the data can be sharded:
Figure 3. Example Bookstore schema showing how data is sharded.
In the Bookstore example, the Primary Shard Table is the ‘customer’ entity. This is the table that is used to shard the data. The ‘customer’ table is the parent of the shard hierarchy, with the ‘customer_order’ and ‘order_item’ entities as child tables. The data is sharded by the ‘customer.id’ attribute, and all related rows in the child tables associated with a given ‘customer.id’ are sharded as well. The Global Tables are the common lookup tables, which have relatively low activity, and these tables are replicated to all shards to avoid cross-shard joins.
While this example is very basic, it does provide the basic considerations for determining how to shard a given database application. By using this evaluation approach, you can determine if sharding is applicable to your particular environment, and what benefits can be realized by implementing Database Sharding.
Disadvantages of sharding
Sharding a database table before it has been optimized locally causes premature complexity. Sharding should be used only when all other options for optimization are inadequate. The introduced complexity of database sharding causes the following potential problems:
These historical complications of do-it-yourself sharding are now being addressed by independent software vendors who provide autos sharding solutions.
CONCLUSION
This paper has covered an overview of Database Sharding, including a discussion of the challenges involved, and a fundamental approach to implementing a sharding solution. Database Sharding has been proven in many large organizations, and may very well be applicable to your specific application problems. When implemented properly, Database Sharding can in fact deliver on the objective of cost-effective, near-linear scalability for high-volume business transaction applications.
References
Specific references:
1.???Google spotlights data center inner workings | Tech news blog - CNET News.com
2.???Map Reduce: Simplied Data Processing on Large Clusters
3.???"Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages." -"Map Reduce: Simplified Data Processing on Large Clusters", by Jeffrey Dean and Sanjay Ghemawat; from Google Research
4.???L?mmel, R. (2008). "Google's Map?Reduce?programming model — revisited".?Science of Computer Programming?70: 1. doi:10.1016/j.scico.2007.07.001.?edit
5.???Czajkowski, Grzegorz,; Marián Dvorsky; Jerry Zhao; Michael Conley. "Sorting Petabytes with Map Reduce - The Next Episode". Google. Retrieved 7 April 2014.
6.???Example: Count word occurrences. Research.google.com. Retrieved on 2013-09-18.
7.???Ullman, J. D. (2012). "Designing good Map?Reduce?algorithms".?XRDS: Crossroads, the ACM Magazine for Students?(Association for Computing Machinery)?19: 30. doi:10.1145/2331042.2331053.?edit
8.???Bosagh Zadeh, Reza; Carlsson, Gunnar.?Dimension Independent Matrix Square Using Map Reduce. Retrieved 12 July 2014.
9.???Cheng-Tao Chu; Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Ng, and Kunle Olukotun. "Map-Reduce for Machine Learning on Multicore". NIPS 2006.
10.?????????????????Ranger, C.; Raghuraman, R.; Penmetsa, A.; Bradski, G.; Kozyrakis, C. (2007). "Evaluating MapReduce for Multi-core and Multiprocessor Systems".?2007 IEEE 13th International Symposium on High Performance Computer Architecture. p.?13. doi:10.1109/HPCA.2007.346181. ISBN?1-4244-0804-0.
11.?????????????????He, B.; Fang, W.; Luo, Q.; Govindaraju, N. K.; Wang, T. (2008). "Mars: a MapReduce framework on graphics processors".?Proceedings of the 17th international conference on Parallel architectures and compilation techniques - PACT '08. p.?260. doi:10.1145/1454115.1454152. ISBN?9781605582825.?edit
12.?????????????????Chen, R.; Chen, H.; Zang, B. (2010). "Tiled-Map Reduce: optimizing resource usages of data-parallel applications on multicore with tiling".?Proceedings of the 19th international conference on Parallel architectures and compilation techniques - PACT '10. p.?523. doi:10.1145/1854273.1854337. ISBN?9781450301787.?edit
13.?????????????????Tang, B.; Moca, M.; Chevalier, S.; He, H.; Fedak, G. (2010). "Towards MapReduce for Desktop Grid Computing".?2010 International Conference on P2P, Parallel, Grid, Cloud and Internet Computing. p.?193. doi:10.1109/3PGCIC.2010.33. ISBN?978-1-4244-8538-3.?edit
14.?????????????????Lin, H.; Ma, X.; Archuleta, J.; Feng, W. C.; Gardner, M.; Zhang, Z. (2010). "MOON: Map Reduce on Opportunistic environments".?Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing - HPDC '10. p.?95. doi:10.1145/1851476.1851489. ISBN?9781605589428.?edit
15.?????????????????Marozzo, F.; Talia, D.; Trunfio, P. (2012). "P2P-MapReduce: Parallel data processing in dynamic Cloud environments".?Journal of Computer and System Sciences?78?(5): 1382. Doi:10.1016/j.jcss.2011.12.021.?edit
16.?????????????????Dou, A.; Kalogeraki, V.; Gunopulos, D.; Mielikainen, T.; Tuulos, V. H. (2010). "Misco: a MapReduce framework for mobile systems".?Proceedings of the 3rd International Conference on PErvasive Technologies Related to Assistive Environments - PETRA '10. p.?1. doi:10.1145/1839294.1839332. ISBN?9781450300711.?edit
17.?????????????????"How Google Works". baselinemag.com. "As of October, Google was running about 3,000 computing jobs per day through MapReduce, representing thousands of machine-days, according to a presentation by Dean. Among other things, these batch routines analyze the latest Web pages and update Google's indexes."
18.?????????????????"Database Experts Jump the Map Reduce Shark".
19.?????????????????David DeWitt; Michael Stonebreaker. "Map Reduce: A major step backwards". Craig-henderson.blogspot.com. Retrieved 2008-08-27.
20.?????????????????"Apache Hive - Index of - Apache Software Foundation".
21.?????????????????Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He and Xiaodong Zhang. "YSmart: Yet Another SQL-to-Map Reduce Translator" (PDF).
22.?????????????????"HBase - HBase Home - Apache Software Foundation".
23.?????????????????"Big table: A Distributed Storage System for Structured Data" (PDF).
24.?????????????????Greg Jorgensen. "Relational Database Experts Jump The Map Reduce Shark". Typicalprogrammer.com. Retrieved 2009-11-11.
25.?????????????????Andrew Pavlo; E. Paulson, A. Rasin, D. J. Abadi, D. J. Dewitt, S. Madden, and M. Stonebreaker. "A Comparison of Approaches to Large-Scale Data Analysis". Brown University. Retrieved 2010-01-11.
26.?????????????????US Patent 7,650,331: "System and method for efficient large-scale data processing "
27.?????????????????Curt Monash. "More patent nonsense — Google Map Reduce". Dbms2.com. Retrieved 2010-03-07.
28.?????????????????"Cluster point XML database". clusterpoint.com.
29.?????????????????"Mongo DB NoSQL database". 10gen.com.
30.?????????????????Zaharia, Matei; Chowdhury, Mosharaf; Franklin, Michael; Shenker, Scott; Stoica, Ion (June 2010). "Spark: Cluster Computing with Working Sets". HotCloud 2010.
General references: