Partition and Replication

Partition and Replication

The manner in which a data set is distributed between multiple nodes is very important. In order for any computation to happen, we need to locate the data and then act on it.

There are two basic techniques that can be applied to a data set. It can be split over multiple nodes (partitioning) to allow for more parallel processing. It can also be copied or cached on different nodes to reduce the distance between the client and the server and for greater fault tolerance (replication).

The picture illustrates the difference between these two: partitioned data (A and B below) is divided into independent sets, while replicated data (C below) is copied to multiple locations.

This is the one-two punch for solving any problem where distributed computing plays a role. Of course, the trick is in picking the right technique for your concrete implementation; there are many algorithms that implement replication and partitioning, each with different limitations and advantages which need to be assessed against your design objectives.

Partitioning

Partitioning is dividing the dataset into smaller distinct independent sets; this is used to reduce the impact of dataset growth since each partition is a subset of the data.

  • Partitioning improves performance by limiting the amount of data to be examined and by locating related data in the same partition
  • Partitioning improves availability by allowing partitions to fail independently, increasing the number of nodes that need to fail before availability is sacrificed

Partitioning is also very much application-specific, so it is hard to say much about it without knowing the specifics. That's why the focus is on replication in most texts, including this one.

Partitioning is mostly about defining your partitions based on what you think the primary access pattern will be, and dealing with the limitations that come from having independent partitions (e.g. inefficient access across partitions, different rate of growth etc.).

Replication

Replication is making copies of the same data on multiple machines; this allows more servers to take part in the computation.

Replication - copying or reproducing something - is the primary way in which we can fight latency.

  • Replication improves performance by making additional computing power and bandwidth applicable to a new copy of the data
  • Replication improves availability by creating additional copies of the data, increasing the number of nodes that need to fail before availability is sacrificed

Replication is about providing extra bandwidth, and caching where it counts. It is also about maintaining consistency in some way according to some consistency model.

Replication allows us to achieve scalability, performance and fault tolerance. Afraid of loss of availability or reduced performance? Replicate the data to avoid a bottleneck or single point of failure. Slow computation? Replicate the computation on multiple systems. Slow I/O? Replicate the data to a local cache to reduce latency or onto multiple machines to increase throughput.

Replication is also the source of many of the problems, since there are now independent copies of the data that has to be kept in sync on multiple machines - this means ensuring that the replication follows a consistency model.

The choice of a consistency model is crucial: a good consistency model provides clean semantics for programmers (in other words, the properties it guarantees are easy to reason about) and meets business/design goals such as high availability or strong consistency.

Only one consistency model for replication - strong consistency - allows you to program as-if the underlying data was not replicated. Other consistency models expose some internals of the replication to the programmer. However, weaker consistency models can provide lower latency and higher availability - and are not necessarily harder to understand, just different.

I hope you found the article useful.

Happy Coding :)

要查看或添加评论,请登录

Umang Agarwal的更多文章

  • Push and Pull Configuration Management Tools

    Push and Pull Configuration Management Tools

    Push and pull configuration management tools are software solutions that facilitate the management and distribution of…

    10 条评论
  • What is Configuration Management in DevOps?

    What is Configuration Management in DevOps?

    Configuration management in DevOps refers to the process of managing and controlling the configuration of software…

    3 条评论
  • Principles of Web Distributed Systems Design

    Principles of Web Distributed Systems Design

    Like most things in life, taking the time to plan ahead when building a web service can help in the long run;…

  • What is non-monotonicity good for?

    What is non-monotonicity good for?

    The difference between monotonicity and non-monotonicity is interesting. For example, adding two numbers is monotonic…

  • Lamport Clocks

    Lamport Clocks

    Assuming that we cannot achieve accurate clock synchronization - or starting with the goal that our system should not…

  • Does time progress at the same rate everywhere?

    Does time progress at the same rate everywhere?

    We all have an intuitive concept of time based on our own experience as individuals. Unfortunately, that intuitive…

  • Two Phase Commit (2PC)

    Two Phase Commit (2PC)

    Two phase commit (2PC) is a protocol used in many classic relational databases. For example, MySQL Cluster (not to be…

  • Forward Proxy vs Reverse Proxy Servers

    Forward Proxy vs Reverse Proxy Servers

    Forward Proxy A forward proxy is an intermediary that sits between one or more user devices and the internet. Instead…

  • Remote Procedure Calls (RPC)

    Remote Procedure Calls (RPC)

    In distributed computing, a remote procedure call (RPC) is when a computer program causes a procedure (subroutine) to…

  • Types of NoSQL databases

    Types of NoSQL databases

    NoSQL is a collection of data items represented in a key-value store, document store, wide column store, or a graph…

社区洞察