NoSQL as a hot topic (without hot partitions, please)
Not sure... I think this is some the anonymous collective

NoSQL as a hot topic (without hot partitions, please)

These last few weeks for me it has all been about getting recertified as AWS Developer, since my current certification would expire on April 30th, 2022. As I mentioned in one of my recent posts, I missed the exam: looks like I booked 12:15AM instead of 12:15PM. The post has more about my rambling and complaints. Since I had studied already I decided to go for the AWS Solutions Architect exam the following day (instead of Dev exam that I lost and was hoping to get rescheduled). I passed. So now I have been officially certified by AWS as Developer and as Architect.

Not that it matters the least having an expired or non-expired cert, but as I said in the past I do believe in certs and I've mentioned in the past:

If you have, it will boost your image and your confidence. If you don't, it does not mean anything.

Finally, I'd also recommend a quick read (or skimming) on this link from Coursera with perspectives on when you could benefit or not from getting a certification.

https://www.coursera.org/articles/are-it-certifications-worth-it        

Daniel, but what about the title of this article?

Thanks for the reminder, dear reader. Yes, NoSQL.

My AWS exam was full of explicit (and also less direct) questions involving NoSQL. And I work for a cloud company that offers a nice NoSQL solution for huge clients. So I was super pumped to see this being such a big thing in my exam!

As you can see, over the last 5 years the popularity of the `NoSQL` keyword stayed HIGH and reasonably stable.

Graph from https://trends.google.com/trends/explore?date=today%205-y&geo=CA&q=%2Fm%2F076tfwq

NoSQL is still pretty much relevant and still has a MASSIVE market to grow. During the last exam a couple weeks ago (again, I'm so happy I passed again ??) I noticed so many interesting things:

  1. S3 for sure is one of the most mentioned subjects of the exam. Honestly, it felt like every other question was concerned about serving static websites from S3 and all the related benefits, issues, and architectures emerging from the n-tier apps that would leverage this tool. This topic makes total sense since I believe it was the first (or one of the very first) services AWS offered back in the day. And come on: who does not enjoy storing some files?
  2. The second most asked type of question in the Solutions Architect exam was related to networking. I am pretty sure the DEV exam I took had almost nothing about networks, but this one has TONS of questions about VPN, Gateways, and how to organize or secure your network infrastructure and your data. I am a seasoned professional (12 years and counting) so I know my way around some of the terms and concepts but I did manage to pass the exam even with all these (irrelevant to me) questions.
  3. Then DynamoDB and Kinesis. This is where I want to focus our attention to.

DynamoDB is a NoSQL database

I work for @DataStax, the leading provider of Apache Cassandra as a service; we call it AstraDB -- give it a try, it is free https://astra.datastax.com/. Having said that, DynamoDB ( another NoSQL solution promoted by Amazon) was the source of many of the questions in the exam. Let me try to break it down a bit what I got from it.

Rate limiting is real.

In DynamoDB's world, you usually provision the number of reads and writes you want and that creates a multitude of questions in regards to the Capacity Units you asked, the Payload size of your regular read/write operation, and the frequency of these two. The specific math is not hard, but you should not overlook this topic if you want to take the exam.

You could have an db that you do not pre-provision. But that is not super common as far I can tell with DynamoDB audience (since this is quite new and folks seem to be accustomed to know how much capacity they hired and how much their bill is supposed to be). But the fact is that even databases that are on-demand still have some internal guardrails in terms of max throughput.

Why would you need to rate-limit database operations?

I don't know. I have suspicions. I will explore this further and post about it. Please let me know in the comments if you have snippets to contribute here.

Let's talk about the "Eventual consistency" (vs Strong) concept

Due to the nature of distributed databases, it is generally accepted that the CAP theorem is the king: you can have 2 of these in your DB of choice, but never the 3.

  1. Consistency
  2. Availability
  3. Partition tolerance (or the term I prefer, resiliency) -- IN DISTRIBUTED SYSTEMS, PARTITIONS CAN NOT BE AVOIDED.

So between CP, AP, or CA, they chose AP. Because all in life is a tradeoff, Consistency is least important for so many use cases (most I am sure I am not even aware of). Having said that, the architecture of DynamoDB is more inclined to Eventual consistency, although Strong consistency mode is provided and then kind-of change the DB mode to CP.

The strong consistency mode affects the availability of the DB, costs more throughput-wise, adds latency, and in my personal opinion, is IN SO MANY CASES not really needed.

1 RCU is equivalent to two eventually consistent reads per second of an item up to 4KB in size

Designing Partition Keys to Distribute Your Workload Evenly ( ??AKA: "WE DO NOT LIKE HOT PARTITIONS!!!" )

This might be news to you, but a Primary key is something that is essential do most databases; you must provide a proper way to get a piece of information uniquely. In DynamoDB (as in Apache Cassandra and DataStax Astra) a primary key might be simple or composite.

  1. Simple - one table column only
  2. Composite - more than one table column

If a composite key is used, then the first column declared if called Partition key and the others are called Sorting keys (also known as Clustering keys). Each primary key attribute must be defined as type `string, number, or binary`. High-cardinality attributes are recommended for DynamoDB partition keys (no, US states are not considered high-cardinality -- UUID generally are).

DynamoDB uses the partition key's value as input to an internal hash function. The output from the hash function determines the partition (physical storage internal to DynamoDB) in which the item will be stored.

Careful design of the sort key lets you retrieve commonly needed groups of related items using range queries with operators such as begins_with, between, >, <, and so on.

All items with the same partition key value are stored together, in sorted order by sort key value.

When it comes to DynamoDB partition key strategies, no single solution fits all use cases. You should evaluate various approaches based on your data ingestion and access pattern, then choose the most appropriate key with the least probability of hitting throttling issues.

Apache Cassandra

Cassandra's database design is based on the requirement for fast reads and writes, so the better the schema design, the faster data is written and retrieved. As a reminder:

Queries are the result of selecting data from a table; schema is the definition of how data in the table is arranged.

Data modeling in Apache Cassandra uses a query-driven approach, in which specific queries are the key to organizing the data.

  1. And MUCH SIMILAR TO DYNAMODB, Consistency level can be set on a per query basis
  2. ?? ? There are no special types of nodes.
  3. A coordinator node is a role for a specific request.
  4. A subsequential request can have another node as the coordinator.
  5. Again: Nodes are all the same.
  6. Writes are always happening on the coordinator.
  7. The write creates a commit log in an 'append-only' data structure (meaning it is super fast), then it shares the same info to the MemTable (in-memory representation).
  8. Cassandra is a write-optimized database.
  9. Memory is flushed to disk from time to time, using SSTABLE.
  10. Then a new MEMTABLE.
  11. SSTABLEs are immutable.
  12. So... Deletion is a new record, called TOMBSTONE. It is a special marker.
  13. COMPACTION takes small SSTABLES and merge then into bigger ones (but runs in a background process).
  14. Backups become trivial because of that: the latest timestamp always wins.
  15. With any Read request that has consistency lower than 'all', Cassandra has the chance to run a Read Repair; it allows to sync all the values and reach agreement on all the nodes.

Is there more?

Of course. I will share more in the future. Ask me if you want to hear more. Provide constructive feedback if there is anything that I missed completely and you are adamant you want to see edited and corrected in the next edition.

???? Have a good one!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了