How The Single Table Design Can Help You Reduce Latency And Costs in DynamoDB
Uriel Bitton
AWS Cloud Engineer | The DynamoDB guy | AWS Certified & AWS Community Builder | I help you build scalable DynamoDB databases ????
DynamoDB is a noSQL key-value datastore capable of low latency and high throughput at no matter the scale.
It is reputable to be infinitely scalable.
But at such a high scale, costs can very quickly become overwhelming.
As your business needs grow, and your application's user base scales, so will your data and the number of tables in your database.
In DynamoDB you pay per table and consequently the more tables you have the more expensive your database becomes.
The solution to this cost issue came in the form of the single table design, a database design pattern invented by Rick Houlihan when he worked at AWS.
The single table design is a design pattern that can offer two very measurable benefits.
In this article, I'll demonstrate how we can implement a database that will support the single table design and how it will provide real value to save on costs and achieve low latency at any scale.
Let's dive right into it.
The Single Table Design
Perhaps the biggest criticism of the single table design is its learning curve. Most database developers come from a relational database background
However in DynamoDB, since you pay per table, the more tables you have the higher your costs are.
The motivation behind the single table design is this quote from Rick Houlihan:
Data that is accessed together should be stored together.
The question now is how the single table design reduces cost and helps us achieve lower latency at scale.
Let's demonstrate with an example.
Say we are building an online social media platform.
We have the following entities in our database:
In a relational database, we would have 4 tables for the 4 entities.
However, in our single table design, we will store all of these entities and their data in one table.
Before we design our database for the single table design, one critical idea with this design pattern is to identify our data access patterns
Let's take a look at what our principal access patterns may be:
These are very typical access patterns of most social media or blogging applications.
The focus of this application is on Posts. Consequently, it's a good idea to use postIds as the partition key for posts, comments, and likes items.
Let's create a "post" item that contains fields like a title, content, and postDate.
It's also standard practice to have a field called "entityType" which records the type of item it is - since we are mixing many items in one table.
The first column is the partition key (pk), which is equal to "post-100" and the second column is the sort key (sk) which is equal to "post-100".
To get this particular post we can run the pseudocode query:
pk = "post-100" and sk = "post-100"
This satisfies our first access pattern.
The next design will satisfy the next 3 access patterns:
Let's explain the screenshot above.
First, each item shown here is identified by a postId as each item's partition key (pk) - essentially the post to which they belong to. This denotes a one-to-many relationship.
By providing the pk = "post-100", you can query for all related items of that post item. This means all comments, likes, and perhaps nested comments and nested comment likes if they exist.
The second and third access patterns we identified above can be satisfied with the following pseudocode query:
pk = "post-100" and sk begins_with("comment")
This will fetch all items which have the partition key "pk" equal to "post-100" and the sort key "sk" which matches all strings that begin with the text value of "comment".
We can apply the same logic to get the 3rd access pattern for likes of a post by replacing the "comment" value with "like".
The 4th and last access pattern can be satisfied by following a similar logic.
领英推荐
pk = "post-100" and sk begins_with(comment-200#like)
This query is similar to the one above, only here we are providing a specific commentId for our sort key ("comment-200").
Using this pattern we can further optimize our query to fetch nested items such as like items belonging to comment items.
Benefits Of The Single Table Design
This brings us to the first of our two benefits of the single table design mentioned earlier - reducing latency
In relational databases we know that their power comes from performing joins on two different tables to fetch related data
In noSQL of course we do not have joins. This is by design.
Joins are slow and expensive in terms of CPU.
noSQL databases are designed to circumvent this issue and perform less expensive CPU operations by avoiding joins altogether.
The question or issue now is: how can we fetch related data without using joins AND reducing the latency joins offer while remaining performant at large scales?
The answer is displayed in the above screenshot.
With this single table design we:
a) Pre-join our related data together
b) denormalize our data to satisfy our access patterns
With other "classical" noSQL databases, fetching multiple records that reside in two different tables/collections would require no less than two operations that must be performed synchronously, causing us to suffer the added latency.
For example, to fetch all comments from a given post requires us to first fetch the given post by its postId, then wait for the response of that post item, use its postId to query for all comments by this given postId and finally we retrieve the desired comments.
This of course adds latency as we wait for the two synchronous operations. It adds more costs as well since we know have to query multiple tables.
With the single table design, we eliminate both issues.
By querying a post by its postId, we retrieve both the post and its comments in one operation - causing no extra latency and keeping costs to a single table cost at all times.
That is incredibly powerful and efficient!
To reiterate: data that is accessed together should be stored together.
Now we can truly understand this idea from Rick Houlihan.
How much can we really save on costs?
As your user base scales, the number of tables will inevitably rise, and with them, the number of indexes.
In DynamoDB, you are charged per table and as well per index on each table.
Like in any good solution architecture, the first and often most important element is cost.
An architecture that accommodates 15 tables (often required) with an average of 3 global secondary indexes (GSIs) per table, translates to 45 individual billable entities which at scale may not be cheap.
If we provision these 45 resources (tables and indexes) at a low cost of $10 per entity, we can expect to pay $450 every month for our database.
On the other hand, with one properly designed "single" table, we can add 5 indexes to satisfy additional access patterns, translating to a total of 6 resources.
By using one table only we can reduce our costs relatively by a lot - instead of paying for 45 resources, we are paying for 6.
Final Thoughts
Following these enormous benefits, should we always use the single table design pattern for all our noSQL needs?
Definitely not.
The single table design is typically recommended when you have the following two needs:
Additionally, following the best practices of the single table design will further drive down costs and efficiency.
Conclusion
The single table design is a database design pattern capable of achieving amazing scalability, low data latency, and measurable cost savings. The STD reduces the need for multiple tables to a single table to achieve better latency when fetching multiple related data as well as reducing costs by maintaining and paying for only one table for our entire application.
By pre-joining related data and eliminating slow joins, we can have data that is accessed together to be stored together.
I hope you learned something of value and practical in this edition of Excelling In DynamoDB.
You can say hi ??, collaborate with me, and explore my social media links here:
Uriel Bitton