DynamoDB Difinition & Data Modeling
Omar Ismail
Senior Software Engineer @ Digitinary | Java 8 Certified? | Spring & Spring Boot?????? | AWS? | Microservices ?? | RESTFul Apis & Integrations ?? FinTech ?? | Open Banking ?? | Digital Payments and Transformation??
Amazon DynamoDB
Amazon DynamoDB -- also known as Dynamo Database or DDB -- is a fully managed?NoSQL?database service provided by?Amazon Web Services. DynamoDB is known for low?latencies?and?scalability.
According to AWS, DynamoDB makes it simple and cost-effective to store and retrieve any amount of data, as well as serve any level of request traffic. All data items are stored on?solid-state drives, which provide high I/O performance and can more efficiently handle high-scale requests. An AWS user interacts with the service by using the?AWS Management Console?or a DynamoDB API.
DynamoDB uses a NoSQL database model, which is nonrelational, allowing documents, graphs and columnar among its data models. A user stores data in DynamoDB tables, then interacts with it via GET and PUT queries, which are read and write operations, respectively. DynamoDB supports basic?CRUD operations?and conditional operations. Each DynamoDB query is executed by a primary key identified by the user, which uniquely identifies each item.
Scalability, availability and durability
DynamoDB enforces replication across three?availability zones?for high availability, durability and read consistency. A user can also opt for cross-region replication, which creates a backup copy of a DynamoDB table in one or more global geographic locations.
The DynamoDB scan API provides two consistency options when reading DynamoDB data: eventually consistent reads and strongly consistent reads. The former, which is the AWS default setting, maximizes?throughput?at the potential expense of not having a read reflect the latest write or update. The latter reflects all writes and updates.
There are no DynamoDB limits on data storage per user, nor a maximum throughput per table.
Security
Amazon DynamoDB offers Fine-Grained Access Control (FGAC) for an administrator to protect data in a table. The admin or table owner can specify who can access which items or attributes in a table and what actions that person can perform. FGAC is based on the?AWS Identity and Access Management?service, which manages credentials and permissions. As with other AWS products, the?cloud provider?recommends a policy of least privilege when granting access to items and attributes.
An admin can view usage metrics for DynamoDB with Amazon?CloudWatch.
Benefits of DynamoDB:
Scalability and Performance
Using Amazon DynamoDB, Developers can combine incremental scalability and high performance with the ease of cloud administration, reliability and table data model and thus can meet the customer demand. It can scale the table assets to a large number of servers on various Availability Zones for addressing storage need. In addition, there is no specific limit to the quantity of data that a table can store. As a result, any amount of data can be stored and retrieved and Dynamo DB shared data across more servers with the increase of data stored in a table.
Cross-region Replication
It enables you to maintain identical copies as replicas of a DynamoDB master table in one or more AWS regions. Once you enable cross-region replication for a table, identical copies of the table are created in other AWS areas. Any mode of changes in the table will be consequently propagated to all replicas.
TTL (Time to Live)
TTL is a process that gives you an opportunity to set a specific timestamp to delete expired data from your tables. Once the timestamp expires, the relating item is marked as expired and is subsequently deleted from the table. By using this functionality, you don’t need to track expired data and delete it manually. TTL can help you reduce storage usage and reduce the cost of storing data that is no longer important.
Fine-grained Access Control
Fine-Grained Access Control gives a DynamoDB table owner a high level of control over data in the table. In particular, the table owner can specify who can access which data or attributes of the table and perform what actions such as read-write or update. Fine-Grained Access Control is used in combination with AWS Identity and Access Management, which manages the security credentials and the related permissions.
Stream
Using the DynamoDB Streams, developers can update and receive the item-level data before and after data are changed. DynamoDB Streams provides a time-ordered sequence of data changes made in a table in the last 24 hours. You can access a stream with a simple API call and use it to keep other data stores updated with the latest changes and take actions based on it.
The following are the basic DynamoDB components:
DynamoDB can have three data types
DynamoDB Data Modeling
As we all know, DynamoDB is a NoSQL database service offered by AWS that supports key-value and document data structures. Data modeling in DynamoDB is much different from traditional relational database systems. Since we are very familiar with relational modeling, most of the new users try to follow the same approach in DynamoDB and end up with a large AWS bill at the end. To avoid situations like this, we must always follow the best practices and recommended methods with DynamoDB. If you follow the correct steps you can achieve a millisecond's latencies at any scale of data for a cheap price. In this article, I will guide you through a 5 step process that will help you to model data for your application.
Why NoSQL?
Nowadays, storage is cheap and computational power is expensive. NoSQL leverages this fact and sacrifices some storage space to allow for computationally easier queries. Essentially, what this means is that when designing your NoSQL data model, you will need to always be thinking of ways to simplify your queries to your database. When used correctly, NoSQL can be a much more cost-effective solution than a relational database.
Why DynamoDB?
Amazon DynamoDB is a key-value and document database that is fully managed, multi-region, and autoscaling so that you don’t have to worry about the infrastructure or datacenter. DynamoDB also offers an “On-Demand Capacity” pricing model. This makes it very accessible for any size application to instantly get started without having to worry about provisioning the capacity or having to upgrade later on.
Understanding the Basics
Unlike relational databases such as MySQL,?NoSQL requires you to constantly ask questions about how the data will be queried. Asking these questions leads you down the path of item organization and how to split items up in a way that is conducive to speedy queries. The first step is to create primary keys for your items which are composed of a?partition key?and?sort key.
Note: You can use just the partition key as the primary key, but for most cases, you will also want to leverage a sort key.
Partition Key
DynamoDB tables are split into partitions. DynamoDB uses the partition key as an input to an internal hash function in which the result determines which partition the item will be stored in.
Hot Partitions
It is important to ensure that your partition keys split your items so that your workload is distributed evenly amongst the partitions to avoid the “hot” partition problem.
For example, let’s say your table is split into 3 partitions and that you have provisioned 3 RCUs (Read-Capacity units) to your table. That means that each partition would have access to?1 RCU.?If 1 partition is hit much more frequently than the other 2, you risk being throttled since you may consume all of that 1 RCU; meanwhile, you are still paying for 3 RCUs.
You can find more about this on the AWS official docs:?Designing Partition Keys to Distribute Your Workload Evenly.
Sort Key
All items with the same partition key are stored together and are ordered by the sort key. By following this pattern, you can very efficiently query for multiple items by using?only the partition key.
5 Steps for Data Modeling
1. Draw an entity diagram
The first step is to create an entity diagram. This helps you to identify the main entities of your application and make mapping between real-life entities and application entities.
let's take a simple example,
There are 3 entities in this diagram named as Employee, Company, and Project with their relevant columns.
2. Identify relationships between entities
Then you need to identify relationships between those entities. There can be One to One, Many to Many, and One to Many relationships between entities.
In the above example, there are 3 main relationships between entities,
3. List down all the access patterns for each entity
This is the most important step of the process and you need to take enough time to finalize all the access patterns for each entity.
In the above example, there are 3 main entities and you need to find all the access patterns for those entities. For this demonstration, I will be only considering access patterns for the Company entity.
Like this, you can find all-access patterns for all main entities and newly created ProjectEmplyee entity as well.
4. Identify primary keys for each entity
In this step, you need to identify primary keys (hash key and sort key) for your access patterns. If you identify the most suitable keys, you will be able to cover almost every access pattern even without secondary indexes
A single table will be used for the given example and the primary key will be a combination of hash key nad Sort key.
Hash key or Partition key?is used to identify the correct partition that the data should be stored. For that DynamoDB passes the value of the key through a hash function and the result of that function decides the partition that the data need to be stored. The?sort key?is used to arrange items within a partition.
There are 2 main conditions that we need to keep in mind when deciding the hash key and sort key.
Let's take the?CRUD operations for the company?access pattern of the Company entity. We need to identify each company uniquely to perform CRUD operations.
PK=COM#C001, SK=#METADATA#C001
Since we have given exact values for sort key and hash key we can uniquely identify c001 company. Similarly, queries for other access patterns can be written as follows,
//Find all projects of a company.
PK=COM#c001, SK begins_with(PRO#)
//Find all employees of a company
PK=COM#c001, SK begins_with(EMP#)
//Find all projects and employees of a company.
PK=COM#c001
But we can't write a query to get organization by name using defined primary keys. That access pattern is not satisfied by the identified primary keys.
5. Identify secondary indexes (if required)
Sometimes, identify primary keys may not be able to satisfy all the access patterns. In cush cases, you need to think about suitable secondary indexes (local or global).
In the above example, we have,?find a company by name?access pattern which was not satisfied by the primary keys. There are methods like?Inverted indexes, GSI overloading, using a sparse index?that can be used to identify secondary indexes. In this case, we will be using GSI overloading.
When we use secondary indexes we can use separate attributes for hash key and sort key with compared to primary keys.
So, I have decided to keep the partition key as the same and changed sort key to a different attribute named filterName. I will be using this attribute as a prefix in queries like this,
领英推荐
//find a company by name
PK=COM#c001, filterName=ORG#CompanyOne
Another Example of Data Modeling
Let’s say that you are designing an application where you need to store information about sports tournaments. We could say that each tournament has teams, players, and matches. The tournament would also have some basic information like location, date, game, and prize.
A very common approach for modeling data in NoSQL is to think in terms of a hierarchy. So what goes on top of our hierarchy? Well, think of it this way: without a tournament, we wouldn’t have teams, players, or matches. The tournament provides the context that connects all of the other items together. So, for each tournament, we want to group all of the items alongside each other so that we can efficiently retrieve all of the tournament data in?one query.
We’ll need to partition each of our tournaments based on a unique but uniformly distributed identifier. For this, I would recommend using UUIDv4 to generate unique tournament ids. So let’s take a look at what this could look like in a DynamoDB Table. Our UUIDv4 tournament id acts as our partition key.
As you can see, we have 4 individual items all with the same partition key and sorted by the sort key. You’ll also notice that each of the items is either prefixed with a description or just a hardcoded value. I will further explain why we do this later on. Also, each of these items has its own unique set of attributes, and they can all be retrieved by performing one simple query call to DynamoDB.
{
"TableName": "tournaments",
"KeyConditionExpression": "partitionKey = :tournamentId",
"ExpressionAttributeValues": {
":tournamentId": "983d39a3-bdd6-4b61-88d5-58595d555b81"
}
}
What if you only want the teams for a given tournament id?
This is where the prefix?team-?comes in handy. Since we prefixed all of the team item sort keys with?team-?we can perform a special function in our KeyConditionExpression —?begins_with. This query call will retrieve all of the teams for a given tournament id (partition key).
{
"TableName": "tournaments",
"KeyConditionExpression": "partitionKey = :tournamentId and begins_with(sortKey, :teamPrefix)",
"ExpressionAttributeValues": {
":teamPrefix": "team-",
":tournamentId": "983d39a3-bdd6-4b61-88d5-58595d555b81"
}
}
What if you only want the basic details?
We can just perform a DynamoDB get item call since we know both the partition key and sort key.
{
"TableName": "tournaments",
"Key": {
"partitionKey": "983d39a3-bdd6-4b61-88d5-58595d555b81",
"sortKey": "tournament-details"
}
}
The Ten Rules for Data Modeling with DynamoDB
Understand the basics of single-table design with DynamoDB
By far, the biggest mindset shift you will make as you move from?a traditional relational database into DynamoDB?is the acceptance of "single-table design". In a relational database, each different entity will receive its own table, specifically shaped to hold that entity's data. There will be columns for each of the entity's attributes with required values and constraints.
With DynamoDB, this isn't the case. You'll jam all of your entities--Customers, Orders, Inventory, etc.--into a single table. There are no required columns and attributes, save for the primary key which uniquely identifies each item.
Your table will look more like the following:
This may look like hieroglyphics if you're new to DynamoDB and NoSQL design, but don't avoid it.
Take some time to understand?why you need single-table design in DynamoDB. Single-table design is about efficiently organizing your data in the shape needed by your access patterns to quickly and efficiently use your data.
You won't have the neat, tidy, spreadsheet-like data that you had with a relational database. In fact, your table will look?"more like machine code than a simple spreadsheet". But you will have a database that will scale to massive levels without any performance degradation.
Know your access patterns before you start
Quote text: "If you don't know where you are going, any road will get you there." (Lewis Carroll)
With single-table design, you design your table to handle your access patterns. This implies that you must know your access patterns before you design.
So many developers want to design their DynamoDB table before they know how they'll use it. But again, this is that relational database mindset creeping into your process. With a relational database, you design your tables first, then add the indexes and write the queries to get the data you need. With DynamoDB, you first ask how you want to access the data, then build the table to handle these patterns.
This requires thoughtful work upfront. It requires engaging with PMs and business analysts to fully understand your application. And while this seems like it slows you down, you'll be glad you've done the work when you don't have to think about scaling your database when your application grows.
Model first, code last
As developers, it's hard not to jump straight into the code. There's nothing quite like the dopamine hit of making something from nothing; from saying, "Yes, I built that!"
But you need to resist that impulse in DynamoDB. Once you've outlined your access patterns, then take the time to model your DynamoDB table. This should be done?outside?of your code. You can use pen & paper, Microsoft Excel, or the?NoSQL Workbench for Amazon DynamoDB.
As you model your code, you should be making two artifacts:
For an example, here's a completed entity chart for one of the examples from my book:
In this case, I just use a spreadsheet to list each entity type and the primary key pattern for each entity. I have additional pages for any secondary indexes in my table.
Once you complete these artifacts, then you can move into implementation. These artifacts will serve as great additions to your service documentation.
Get comfortable with denormalization
When learning relational data modeling, we heard all about normalization. Don't repeat data in multiple places. First normal form, second normal form, etc. As you normalize your data, you can join together multiple tables as query time to get your final answer.
Normalization was built for a world with very different assumptions. In the data centers of the 1980s, storage was at a premium and compute was relatively cheap. But the times have changed. Storage is cheap as can be, while compute is at a premium.
Relational patterns like joins and complex filters use up valuable compute resources. With DynamoDB, you optimize for the problems of today. That means conserving on compute by eschewing joins. Rather, you denormalize your data whether by?duplicating data across multiple records?or by?storing related records directly on a parent record.
With denormalizing, data integrity is more of an application concern. You'll need to consider when this duplicated data can change and how to update it if needed. But this denormalization will give you a greater scale than is possible with other databases.
Ensure uniqueness with your primary keys
A common requirement in data modeling is that you have a property that is unique across your entire application. For example, you may not want two users to register with the same username, or you may want to prevent two orders with the same OrderId.
In DynamoDB, each record in a table is uniquely identified by the primary key for your table. You can use this primary key to ensure there is not an existing record with the same primary key. To do so, you would use a?Condition Expression?to prevent writing an item if an item with the same key already exists.
One additional caveat: you can only assert uniqueness on a?single attribute?with your primary key. If you try to assert uniqueness across two attributes by building both into a primary key, you will only ensure that no other item exists with the same?combination?of two attributes.
For example, imagine you require a username and an email address to create a user. In your application, you want to ensure no one else has the same username?and?that no other account has used the same email address. To handle this in DynamoDB, you would need to?create two items in a transaction?where each operation asserts that there is not an existing item with the same primary key.
If you use this pattern, your table will end up like the following:
Avoid hot keys
Like most NoSQL databases, DynamoDB partitions (or 'shards') your data by splitting it across multiple instances. Each instance holds only a subset of your data. This partitioning mechanism is what underlies the?ability of NoSQL databases to scale further than SQL databases. If your data is all on one machine, you need to scale to larger and larger instance sizes with more RAM and CPU. You'll get decreasing returns on this scale, and eventually, you'll hit the limits of scaling on a single instance altogether.
To partition your data, DynamoDB uses the concept of a?partition key. The partition key is part of your primary key and is used to indicate which instance should store that particular piece of data.
Even with this partitioning strategy, you need to be sure to avoid hot keys. A hot key is a partition key in your DynamoDB table that receives significantly more traffic than other keys in your table. This can happen if your data is highly skewed, such as the data has a?Zipf distribution, or it can happen if you model your data incorrectly.
DynamoDB has done a?ton of work to make hot keys less of an issue for you. This includes moving your total table capacity around to the keys that need it so it can better handle uneven distributions of your data.
The biggest concern you need to consider with hot keys is around partition limits. A single partition in DynamoDB cannot exceed 3,000 RCUs or 1,000 WCUs. Those are?per second?limits, so they go pretty high, but they are achievable if you have a high scale application.
Handle additional access patterns with secondary indexes
When data modeling with DynamoDB, your primary key is paramount. It will be used to enforce uniqueness, as discussed above. It's also used to?filter and query your data.
But you may have multiple, conflicting access patterns on a particular item in your table. One example I often use is a table that contains the roles played by actors and actresses in different movies. You may have one access pattern to fetch the roles by actor name, and another access pattern to fetch the roles in a particular movie.
Secondary indexes allow you to handle these additional access patterns. When you create a secondary index on your table, DynamoDB will handle copying all your data from your main table to the secondary index in a redesigned shape. In our movie roles example above, our main table may use the actor or actress's name as the partition key, while the secondary index could use the movie name as the partition key. This allows for handling both of our access patterns without requiring us to maintain two copies of the data ourselves.
Build aggregates into your data model
Relationships between objects, whether?one-to-many relationships?or many-to-many relationships are common in data modeling. You'll have one entity (the 'parent') that has a number of related entities. Examples include customers to orders (a single customer will make multiple orders over time) or companies to employees (a single company will have many employees).
Often, you'll want to display the total count of related entities when showing the parent item. But for some relationships, this count can be quite large. Think of the number of stargazers for the?React?repository on GitHub(over 146,000) or the number of retweets on a particularly famous selfie from the Oscar's(over 3.1 million!).
When showing these counts, it's inefficient to count all the related records in your data each time to show the count. Rather, you should store these aggregates on the parent item as the related item is inserted.
There are two ways you can handle this. First, you can use?DynamoDB Transactions to increment the count at the same time you create the related item. This is good to use when you have a large distribution of parent items and you want to ensure the related item doesn't already exist (e.g. that a given user hasn't starred this repo or retweeted this tweet before).
A second option is to use?DynamoDB Streams. DynamoDB Streams allow you to turntable updates into an event stream allowing for asynchronous processing of your table. If you have a small number of items you're updating, you might want to use DynamoDB Streams to batch your increments and reduce the total number of writes to your table.
Use ISO-8601 format for timestamps
With DynamoDB, you can use multiple different attribute types, including strings, numbers, and maps.
One question I often get is around the best way to represent timestamps. Should you use an epoch timestamp, which is an integer representing the number of seconds passed since January 1, 1970, or should you use a human-readable string?
In most cases, I recommend using the?ISO-8601 time format. This is a string-based representation of the time, such as?2020-04-06T20:18:29Z.
The benefits of the ISO-8601 format are two-fold. First, it is human-readable, which makes it easier to debug quickly in the AWS console. The ISO-8601 example above is much easier to parse than its corresponding epoch timestamp of?1586204309. Second, the ISO-8601 format is still sortable. If you're using a composite primary key, DynamoDB will sort all the items within a single partition in order of their UTF-8 bytes. The ISO-8601 format is designed to be sortable when moving from left-to-right, meaning you get readability without sacrificing sorting.
In the example below, we are storing sensor readings from an?IoT device. The partition key is the SensorId, and the sort key is the ISO-8601 timestamp for the reading:
Now with this in mind, there are two times you should avoid ISO-8601 timestamps in favor of epoch timestamps. The first is if you're using?DynamoDB Time-to-Live (TTL)?to automatically expire items from your table. DynamoDB requires your TTL attribute to be an epoch timestamp of type number in order for TTL to work.
Second, you should use epoch timestamps if you actually plan to do math on your timestamps. For example, imagine you have an attribute that tracks the time at which a user's account runs out. If you have a way in your application where a user can purchase more time, you may want to run an update operation to increment that attribute. If a user purchases another hour of playtime, you could increase the time by 3600 seconds. This would allow you to operate on the timestamp directly without reading it back first.
Use On-Demand pricing to start
My last tip is billing-related. DynamoDB offers two different billing modes for operations: provisioned and on-demand. With provisioned capacity, you state in advance the number of read capacity units and write capacity units that you want available for your table. If your table exceeds those limits, you can see throttled reads or writes on your table.
With on-demand pricing, you don't need to provision capacity upfront. You only pay for each request you make to DynamoDB. This means no capacity planning and no throttling (unless you scale extremely quickly!).
Thanks And Credit To : https://medium.com/swlh & https://codeburst.io/ & https://docs.aws.amazon. & https://www.trek10.com/