Amazon DynamoDB — How it Reads/Writes Data Under the Hood
Asim Hafeez
Senior Software Engineer | Lead | AI | LLMs | System Design | Blockchain | AWS
What is DynamoDB?
DynamoDB offered by Amazon Web Services (AWS), is a fully managed NoSQL database that guarantees quick and consistent performance and provides effortless scalability. Its data model is flexible that enables the storage and retrieval of any type of data, accommodating both document and key-value structures. Its top priority is security, which includes network isolation. Its low-latency data access makes it a preferred option for a variety of applications to perform better at scale.
We are going to look into that how DynamoDB stores and retrieves data under the hood. For this, we are going to discuss the two requests offered by DynamoDB.
Later on, we will also see how a?table?having multiple rows of data is going to be stored in DynamoDB.
GET Request
When an application requests data from DynamoDB, the request is directed through the network regardless of its origin, whether it’s from a VPC, public network, or EC2. DynamoDB handles the request and returns the data without any consideration of the source. The network forwards the request to DynamoDB, and it is fulfilled with the requested data.
Upon passing through the network, the request then reaches a stateless component known as the?Request Router. The exact Request Router that handles the request does not matter as they are interchangeable. The first step for the Request Router is to check the requester’s authorization through the?Authentication service. If the requester is authorized or authenticated, the request continues, otherwise, it returns an error indicating either unauthorized or unauthenticated.
The Authentication Service utilized by DynamoDB is the same used throughout AWS. It involves a policy written in JSON format that specifies what actions are allowed and what are not for the requester.
After successful authentication and authorization, the Request router is set to send the request to the Nodes where the data is stored. But, before that, there is another service known as?Partition Metadata System?connected to the Request router.
The Partition Metadata System contains information about the partition, including the leader node within it. There are several storage nodes present in each availability zone, a topic that will be discussed later. To determine the master or leader node among the storage nodes, DynamoDB uses the?Paxos-Algorithm?to elect a leader.
The Request Router forwards the request to one of the storage nodes to balance the load and returns the requested data to the application.
Because of its partition tolerance, DynamoDB does not guarantee the most consistent read. The Request Router may route the request to a node that may not have the most recent data, potentially returning outdated information. This is why DynamoDB provides Eventually Consistent Reads, although the likelihood of eventually consistent data is low due to network issues. However, in most cases, the data is consistent.
PUT Request
When an application wants to store data in DynamoDB, the process is similar to a GET request, but with some differences at the end. The request is sent through a network, regardless of whether it is a public network, VPC, or EC2, and it reaches the Request Router. As previously discussed, the Request Router then forwards the request to the Authentication Service for authorization and authorization.
DynamoDB increases durability by replicating data to two additional storage nodes when it is sent to a storage node. To improve latency, DynamoDB immediately returns a flag indicating that the data has been stored once it has confirmed that the replication has occurred, without requiring the user to wait.
领英推荐
For a storage node to become a leader, it must have all the updates and modifications active. To perform a conditional put, the leader must be aware of the correct value for comparison. Every time a PUT request is initiated, it is first directed to the leader node for data storage and then the replication to other peer nodes takes place.
Each storage node maintains the heartbeats of the other nodes. If the heartbeat of a storage node stops, it is assumed that the node has gone down. The remaining nodes then determine who will become the new leader by evaluating each other against the necessary criteria, and the node that satisfies the requirements becomes the new leader.
To achieve maximum availability, data must be stored in multiple availability zones. There are numerous request routers and storage nodes in each availability zone to handle incoming requests. When a request travels through the network, it is first directed to the nearest availability zone and then to a random request router, which is stateless and therefore doesn’t matter which router it reaches.
After reaching a request router, the request is redirected to the leader storage node. Once the data is stored in the leader, it is then replicated to the other peer nodes in different availability zones through asynchronous connections. When the data is stored on at least two nodes, including the leader, a successful response is sent back to the application, confirming that the data has been stored in DynamoDB.
Let’s take a closer look at how the DynamoDB table is stored in these storage nodes.
Table
Imagine we have a table containing multiple rows of user information. Let’s examine how this data is distributed among the various storage nodes.
DynamoDB employs a secret hash function on the primary key of each table to produce a unique hash key. The advantage of this hash function is that it consistently generates the same hash value for the same data.
Once the hash of each primary key has been generated, DynamoDB organizes the data based on the hashes and assigns each hash to a specific partition for storage.
DynamoDB creates partitions for the table and distributes them to storage nodes in each availability zone. The selection of the leader storage node is determined by the?Pexos-Algorithm?that runs among the storage nodes.
As we have discussed earlier, in DynamoDB, the data is stored across multiple storage nodes, leading to the possibility of an Eventually Consistent Read. This occurs when a GET request is sent to a node that has not yet been updated, resulting in inconsistent data being returned. However, this is a rare occurrence.
A PUT request is considered complete when data has been successfully stored in two of the nodes, thereby improving both latency and durability.
Summary
In this article, the process of GET and PUT requests was explored. When a GET request is made, it is sent via a network to a request router. After authentication and authorization, it is directed to a storage node, where the requested data is retrieved and returned to the application. The process for POST requests is similar, but it is important to know the leader storage node, as the data must first be stored there before being replicated to peer nodes. Once the data has been stored on two nodes, the PUT request is considered complete. We also looked at the storage of tables in DynamoDB.
If this article provided insight into the inner workings of DynamoDB and was helpful to you, please consider giving it a clap.
Senior Software Engineer at P+
11 个月Exactly how does the storage node handle the read request? Is it a collection of "worker" threads that handle reads directed to a storage node? Is it a single worker per table on that storage node? Can you please expand/elaborate on that? Thank you.
Engineering Lead | Solution Architect | Cloud Engineer | FinTech | SaaS | PaaS | AWS | Azure | GCP
2 年Great article ??