NoSQL as a hot topic (without hot partitions, please)
These last few weeks for me it has all been about getting recertified as AWS Developer, since my current certification would expire on April 30th, 2022. As I mentioned in one of my recent posts, I missed the exam: looks like I booked 12:15AM instead of 12:15PM. The post has more about my rambling and complaints. Since I had studied already I decided to go for the AWS Solutions Architect exam the following day (instead of Dev exam that I lost and was hoping to get rescheduled). I passed. So now I have been officially certified by AWS as Developer and as Architect.
Not that it matters the least having an expired or non-expired cert, but as I said in the past I do believe in certs and I've mentioned in the past:
If you have, it will boost your image and your confidence. If you don't, it does not mean anything.
Finally, I'd also recommend a quick read (or skimming) on this link from Coursera with perspectives on when you could benefit or not from getting a certification.
https://www.coursera.org/articles/are-it-certifications-worth-it
Daniel, but what about the title of this article?
Thanks for the reminder, dear reader. Yes, NoSQL.
My AWS exam was full of explicit (and also less direct) questions involving NoSQL. And I work for a cloud company that offers a nice NoSQL solution for huge clients. So I was super pumped to see this being such a big thing in my exam!
As you can see, over the last 5 years the popularity of the `NoSQL` keyword stayed HIGH and reasonably stable.
NoSQL is still pretty much relevant and still has a MASSIVE market to grow. During the last exam a couple weeks ago (again, I'm so happy I passed again ??) I noticed so many interesting things:
DynamoDB is a NoSQL database
I work for @DataStax, the leading provider of Apache Cassandra as a service; we call it AstraDB -- give it a try, it is free https://astra.datastax.com/. Having said that, DynamoDB ( another NoSQL solution promoted by Amazon) was the source of many of the questions in the exam. Let me try to break it down a bit what I got from it.
Rate limiting is real.
In DynamoDB's world, you usually provision the number of reads and writes you want and that creates a multitude of questions in regards to the Capacity Units you asked, the Payload size of your regular read/write operation, and the frequency of these two. The specific math is not hard, but you should not overlook this topic if you want to take the exam.
You could have an db that you do not pre-provision. But that is not super common as far I can tell with DynamoDB audience (since this is quite new and folks seem to be accustomed to know how much capacity they hired and how much their bill is supposed to be). But the fact is that even databases that are on-demand still have some internal guardrails in terms of max throughput.
Why would you need to rate-limit database operations?
I don't know. I have suspicions. I will explore this further and post about it. Please let me know in the comments if you have snippets to contribute here.
Let's talk about the "Eventual consistency" (vs Strong) concept
Due to the nature of distributed databases, it is generally accepted that the CAP theorem is the king: you can have 2 of these in your DB of choice, but never the 3.
So between CP, AP, or CA, they chose AP. Because all in life is a tradeoff, Consistency is least important for so many use cases (most I am sure I am not even aware of). Having said that, the architecture of DynamoDB is more inclined to Eventual consistency, although Strong consistency mode is provided and then kind-of change the DB mode to CP.
The strong consistency mode affects the availability of the DB, costs more throughput-wise, adds latency, and in my personal opinion, is IN SO MANY CASES not really needed.
1 RCU is equivalent to two eventually consistent reads per second of an item up to 4KB in size
Designing Partition Keys to Distribute Your Workload Evenly ( ??AKA: "WE DO NOT LIKE HOT PARTITIONS!!!" )
This might be news to you, but a Primary key is something that is essential do most databases; you must provide a proper way to get a piece of information uniquely. In DynamoDB (as in Apache Cassandra and DataStax Astra) a primary key might be simple or composite.
If a composite key is used, then the first column declared if called Partition key and the others are called Sorting keys (also known as Clustering keys). Each primary key attribute must be defined as type `string, number, or binary`. High-cardinality attributes are recommended for DynamoDB partition keys (no, US states are not considered high-cardinality -- UUID generally are).
DynamoDB uses the partition key's value as input to an internal hash function. The output from the hash function determines the partition (physical storage internal to DynamoDB) in which the item will be stored.
Careful design of the sort key lets you retrieve commonly needed groups of related items using range queries with operators such as begins_with, between, >, <, and so on.
All items with the same partition key value are stored together, in sorted order by sort key value.
When it comes to DynamoDB partition key strategies, no single solution fits all use cases. You should evaluate various approaches based on your data ingestion and access pattern, then choose the most appropriate key with the least probability of hitting throttling issues.
Apache Cassandra
Cassandra's database design is based on the requirement for fast reads and writes, so the better the schema design, the faster data is written and retrieved. As a reminder:
Queries are the result of selecting data from a table; schema is the definition of how data in the table is arranged.
Data modeling in Apache Cassandra uses a query-driven approach, in which specific queries are the key to organizing the data.
Is there more?
Of course. I will share more in the future. Ask me if you want to hear more. Provide constructive feedback if there is anything that I missed completely and you are adamant you want to see edited and corrected in the next edition.
???? Have a good one!