Seesaw Engineering | DynamoDB as the Foundation of Seesaw
Reflections on our big bet on DynamoDB
When Seesaw started nearly eight years ago, we searched for a database that would easily scale as we grew, require minimal sharding and schema migrations, and be both reliable and affordable.
With this laundry list of needs, we found our answer in DynamoDB . DynamoDB is a managed, serverless, key-value database that has no upper limit when it comes to overall table size or read/write rates (with a few caveats). We’ve leveraged it as our source-of-truth primary datastore for every Seesaw service we’ve built, and it has performed remarkably well in this capacity over the years.
In this post, we’ll discuss the benefits of DynamoDB and the various ways we utilize it to power Seesaw. In subsequent posts, we’ll share more complex examples of how we use this technology.
DynamoDB Overview
Before we get into the nitty-gritty details about how Seesaw uses Dynamo, we should first explain how Dynamo stores data and supports data access.
Anatomy of a DynamoDB Row
Every DynamoDB row has a unique identifier called a Primary Key. This Primary Key is the mechanism by which we can retrieve a single row out of our DynamoDB table.
The Primary Key is split into two parts. The first part, the Partition Key, is a required field that Dynamo uses to group data together on a single partition and can be used as a standalone primary key.
The second part is the optional Sort Key. Including this value enables efficient sorting of all data that is stored with the same Partition Key. The combined Partition and Sort Key must be a unique value.
Once you have your Primary Key sorted out, you can populate each row with additional data stored simply as keys and values. These key-value pairs are extremely flexible as to what data you can store: Dynamo supports a variety of data types out of the box including strings, lists, and JSON.
Access Patterns
Once we understand how data is grouped and stored in Dynamo, we can explore how we can pull that data out. There are two interesting access patterns (the third, Scan, is not recommended outside of very specific circumstances at any significant scale, so we’ll leave it out for brevity).
Get
This one is the most straightforward. We specify the full Primary Key (which could be just a Partition Key, or a combined Partition and Sort Key) and Dynamo retrieves the associated row for us. We can also execute these in parallel by leveraging the BatchGet API.
Example: Let’s say we have a simple table that represents Seesaw classes, where the Primary Key is simply a Partition Key that is a unique identifier. We can use a Get API call to fetch the full row corresponding to the class by specifying its unique ID.
Query
This is where things get interesting. If we model our data to have a Partition Key and Sort Key, we unlock a powerful DynamoDB feature: Queries. We can now use a Query API call to fetch all of the rows associated with a single Partition Key, which can optionally be returned sorted or filtered based on the Sort Key.
Example: Let’s now say we have a table that contains Posts made by the students in a Seesaw Class. One way to store this data could be with a Partition Key set to the Class’s unique ID, and then a Sort Key that contains a concatenated timestamp of the post’s upload time and a unique Post ID. So something like this:
Now we can execute a Query to access this data: give me all the Posts in the specified Class. Or we can do: give me all the Posts in the specified Class between a given Monday and Friday. Not exactly rocket science for a database, but when you consider that these are millisecond operations that cost fractions of a cent to execute with no concerns about scaling, it becomes pretty impressive.
领英推荐
Classic Key-Value Store
At Seesaw, our primary API backend service leverages a more traditional model-to-table approach. In other words, the data model representing a Seesaw Class has an associated DynamoDB seesaw_class table. There’s no need for a Sort Key in this approach since the data access pattern we are trying to optimize for is Get (and BatchGet) operations.?
Since most operations in Seesaw happen within the context of a Seesaw Class, it makes a lot of sense to optimize for this approach. You can see how loading data to power the UI is a series of Dynamo Get or BatchGet calls:
This model works well when there is a reasonable upper limit on the number of related rows that we need to fetch. For example, there’s an upper bound on the number of students that can be in a class, so it’s fine to denormalize that data and store the student UUIDs on the Class object.
However, it starts to break down when there is no upper bound on the amount of data to return. An example of this scenario would be fetching the Posts in a Class. There could be thousands of Posts, so it doesn’t make sense to denormalize them onto the Class.
There are two options to mitigate this, and we’ve gone with a hybrid approach. One is to replicate our DynamoDB data into a secondary data store that allows for more flexible queries. We’ve chosen ElasticSearch as that secondary data store. That decision has come with some tradeoffs, and deserves a blog post in and of itself!
The other option is to use Dynamo’s Global Secondary Indexes (GSI). A GSI is a managed, secondary table that can use a different Partition and Sort Key than the main table, and is seamlessly kept up-to-date by Dynamo beneath the hood. It’s important to keep in mind that you do have to pay 2x for every write since it’s now being written to both the main table and the GSI.
Using this Post case as an example, we could build a GSI on our Post table where the Partition Key is the Author ID, and the Sort Key is the Post’s creation epoch timestamp. Now we can easily Query for all the Posts from a given Author and page through the results. Even if we didn’t have this access pattern in mind when the table was created, we always have the opportunity to add this GSI later.
Query-Optimized Single Table Design
As the team and Seesaw platform have grown, we’ve continued to find ways to better leverage DynamoDB in a cost-efficient manner. We’ve found that there are some operational benefits of building out a single table instead of having one table per data model. There are trade-offs here for sure, but there are some nice wins around less spiky access patterns and reduced overhead adjusting our Dynamo autoscaling targets (which automatically adjust table capacity based on usage but can quickly get expensive).
Modeling data in a single table requires only small changes. Instead of using, for example, a Class ID as the Partition Key, we use a generic column, like _pk (to represent partition_key) and _sk (to represent sort_key). The ORM layer will automatically handle reads and writes to these generic fields.
So we might model our Seesaw Post data like so:
Which corresponds to a row in Dynamo in a Generic Single Table:
Thanks to the prefix, we’re guaranteed to not have collisions with other object types, and we’re optimized for Queries. Nice!
Summary
As we can see, Seesaw leverages DynamoDB in a variety of ways. In future posts, we’ll continue to describe how we benefit from DynamoDB’s flexibility, such as by using Single Table Design to create our own Graph Model.
If these types of technical challenges interest you, come join us ! As Seesaw continues to help teachers, students, and families in these difficult times, we’ll be encountering new scaling challenges to support their educational experiences.?
Thanks to Alex Debrie for lending his expertise to the Seesaw team. He has an excellent book about Dynamo that you can find here . We would also like to thank Zach Schneider for spearheading this post and Eugene Li , Lauren Block , Mila Petranovic , and Taek Yun for supporting.
Until next time!
Seesaw Engineering