Version Vector(II)

Version Vector(II)

In my last article, we saw how Distributed Data Stores use Version Vector to identify concurrent updates to data records. We looked at one of the techniques of identifying concurrent updates/conflicts by leveraging ClientId as an Actor & the advantages and disadvantages of doing so. In this article, we’ll look at another approach for identifying concurrent updates/conflicts.

Server As An?Actor

The problem with Server as an Actor is that of Actor Explosion, as the number of clients can grow to a very high number. To solve that, we can leverage servers as actors.

But, you can ask, we can have very large clusters as well, across multiple regions and that might face the same problem of Actor Explosion.?

Yes, You’re right! Hence, we define servers as the number of nodes defined by the replication factor. If you remember, Each data record is tied to Version Vectors & hence for each data record, the maximum size of the version vectors will be the replication factor for the data in that cluster.

No alt text provided for this image

Let’s try to understand what’s happening in the above diagram -

  1. Let’s assume we have a key K, with value U. We’re assuming that we have an empty version vector, to begin with. Client’s C2 and C3 sync the same state from the Replica(Assuming all clients are interacting with the same replica) that’s implementing Version Vectors.
  2. C2 updates the value to W & sends a PUT command, with the local state of Version Vector it has(empty VV).
  3. C3 updates the value to V & sends a PUT command, with the local state of Version Vector it has(empty VV).
  4. Replica A receives the request from C3 first(C2’s request might be delayed because of network latency). Replica A compares the Version Vector it received with its local state & sees that they match. So it increments the counter to 1 & updates the value to V.
  5. The request from C2 finally arrives at Replica A. Replica A compares the Version Vector it received with its local state & sees that the vector it received does not match it’s state. The following approaches can be taken —

  • Replica A can ignore the request from C2, as its local version vector is higher than the incoming version vector from C2.
  • Replica A can update the value to W and counter to 2. This way, we end up serializing requests from C2 & C3 and applying Last-Write-Wins strategy.
  • Replica A can store both the values V & W as siblings & increment the counter to 2. There is again a risk of Siblings' explosions here, and unlike the last two options, we’re leaving the conflict resolution between the two values to be done later!?

Advantage -

  • Does not suffer from Actor Explosion, like the Client As An Actor approach.

Concerns -

  • Potential missing data if sequential write requests where the local version vector is higher, can be ignored.
  • The potential issue of Sibling Explosion if siblings are being stored. Also, as siblings grow in size, there would be performance issues during reconciliation.
  • If siblings are being stored, no way to track causality in the merged state. Eg — K?: {(A, 2)}?: W, V cannot tell me which update V/W happened before which.

This brings us to the end of this article. Server As An Actor is definitely a promising approach to avoid the explosion problem, but the server’s still a proxy to the clients which are actually performing the operations. Hence, it also suffers from issues based on different approaches, where without siblings, we can have data loss/updates and with siblings, we don’t have a way to track causality in the merged state.

In our next article, we’ll cover the best approach to solving this problem. So stay tuned!

Thank you for reading! I’ll be posting weekly content on distributed systems & patterns, so please like, share and subscribe to this?newsletter?for notifications of new posts.

Please comment on the post with your feedback, will help me improve! :)

Until next time, Keep asking questions & Keep learning!



要查看或添加评论,请登录

Pratik Pandey的更多文章

社区洞察

其他会员也浏览了